期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Cluster analysis using different correlation coefficients

Seong S. Chae Chansoo Kim Jong-Min Kim William D. Warde 《Statistical Papers》2008,49(4):715-727

Partitioning objects into closely related groups that have different states allows to understand the underlying structure in the data set treated. Different kinds of similarity measure with clustering algorithms are commonly used to find an optimal clustering or closely akin to original clustering. Using shrinkage-based and rank-based correlation coefficients, which are known to be robust, the recovery level of six chosen clustering algorithms is evaluated using Rand’s C values. The recovery levels using weighted likelihood estimate of correlation coefficient are obtained and compared to the results from using those correlation coefficients in applying agglomerative clustering algorithms. This work was supported by RIC(R) grants from Traditional and Bio-Medical Research Center, Daejeon University (RRC04713, 2005) by ITEP in Republic of Korea. 相似文献

2.

A Pitman measure of similarity in k-means for clustering heavy-tailed data

Arman Reybod Javad Etminan Adel Mohammadpour 《统计学通讯:模拟与计算》2019,48(6):1595-1605

One of the most popular methods and algorithms to partition data to k clusters is k-means clustering algorithm. Since this method relies on some basic conditions such as, the existence of mean and finite variance, it is unsuitable for data that their variances are infinite such as data with heavy tailed distribution. Pitman Measure of Closeness (PMC) is a criterion to show how much an estimator is close to its parameter with respect to another estimator. In this article using PMC, based on k-means clustering, a new distance and clustering algorithm is developed for heavy tailed data. 相似文献

3.

A novel fast heuristic to handle large-scale shape clustering

《Journal of Statistical Computation and Simulation》2012,82(1):160-169

Clustering algorithms like types of k-means are fast, but they are inefficient for shape clustering. There are some algorithms, which are effective, but their time complexities are too high. This paper proposes a novel heuristic to solve large-scale shape clustering. The proposed method is effective and it solves large-scale clustering problems in fraction of a second. 相似文献

4.

Clustering large number of extragalactic spectra of galaxies and quasars through canopies

Tuli De Didier Fraix Burnet Asis Kumar Chattopadhyay 《统计学通讯:理论与方法》2013,42(9):2638-2653

Abstract

Cluster analysis is the distribution of objects into different groups or more precisely the partitioning of a data set into subsets (clusters) so that the data in subsets share some common trait according to some distance measure. Unlike classification, in clustering one has to first decide the optimum number of clusters and then assign the objects into different clusters. Solution of such problems for a large number of high dimensional data points is quite complicated and most of the existing algorithms will not perform properly. In the present work a new clustering technique applicable to large data set has been used to cluster the spectra of 702248 galaxies and quasars having 1,540 points in wavelength range imposed by the instrument. The proposed technique has successfully discovered five clusters from this 702,248X1,540 data matrix. 相似文献

5.

Clustering objects on subsets of attributes (with discussion)

Jerome H. Friedman Jacqueline J. Meulman 《Journal of the Royal Statistical Society. Series B, Statistical methodology》2004,66(4):815-849

Summary. A new procedure is proposed for clustering attribute value data. When used in conjunction with conventional distance-based clustering algorithms this procedure encourages those algorithms to detect automatically subgroups of objects that preferentially cluster on subsets of the attribute variables rather than on all of them simultaneously. The relevant attribute subsets for each individual cluster can be different and partially (or completely) overlap with those of other clusters. Enhancements for increasing sensitivity for detecting especially low cardinality groups clustering on a small subset of variables are discussed. Applications in different domains, including gene expression arrays, are presented. 相似文献

6.

Clustering of Variables Based on Watson Distribution on Hypersphere: A Comparison of Algorithms

Adelaide Figueiredo Paulo Gomes 《统计学通讯:模拟与计算》2015,44(10):2622-2635

We consider n individuals described by p variables, represented by points of the surface of unit hypersphere. We suppose that the individuals are fixed and the set of variables comes from a mixture of bipolar Watson distributions. For the mixture identification, we use EM and dynamic clusters algorithms, which enable us to obtain a partition of the set of variables into clusters of variables.

Our aim is to evaluate the clusters obtained in these algorithms, using measures of within-groups variability and between-groups variability and compare these clusters with those obtained in other clustering approaches, by analyzing simulated and real data. 相似文献

7.

Comparison of clustering algorithms on generalized propensity score in observational studies: a simulation study

Chunhao Tu Shuo Jiao Woon Yuen Koh 《Journal of Statistical Computation and Simulation》2013,83(12):2206-2218

In observational studies, unbalanced observed covariates between treatment groups often cause biased inferences on the estimation of treatment effects. Recently, generalized propensity score (GPS) has been proposed to overcome this problem; however, a practical technique to apply the GPS is lacking. This study demonstrates how clustering algorithms can be used to group similar subjects based on transformed GPS. We compare four popular clustering algorithms: k-means clustering (KMC), model-based clustering, fuzzy c-means clustering and partitioning around medoids based on the following three criteria: average dissimilarity between subjects within clusters, average Dunn index and average silhouette width under four various covariate scenarios. Simulation studies show that the KMC algorithm has overall better performance compared with the other three clustering algorithms. Therefore, we recommend using the KMC algorithm to group similar subjects based on the transformed GPS. 相似文献

8.

Cluster Data Streams with Noisy Variables

Hu Yang Chenqun Yu 《统计学通讯:模拟与计算》2016,45(4):1381-1396

Clustering algorithms are important methods widely used in mining data streams because of their abilities to deal with infinite data flows. Although these algorithms perform well to mining latent relationship in data streams, most of them suffer from loss of cluster purity and become unstable when the inputting data streams have too many noisy variables. In this article, we propose a clustering algorithm to cluster data streams with noisy variables. The result from simulation shows that our proposal method is better than previous studies by adding a process of variable selection as a component in clustering algorithms. The results of two experiments indicate that clustering data streams with the process of variable selection are more stable and have better purity than those without such process. Another experiment testing KDD-CUP99 dataset also shows that our algorithm can generate more stable result. 相似文献

9.

Self-updating clustering algorithm for estimating the parameters in mixtures of von Mises distributions

Wen-Liang Hung Shou-Jen Chang-Chien Miin-Shen Yang 《Journal of applied statistics》2012,39(10):2259-2274

The EM algorithm is the standard method for estimating the parameters in finite mixture models. Yang and Pan [25] proposed a generalized classification maximum likelihood procedure, called the fuzzy c-directions (FCD) clustering algorithm, for estimating the parameters in mixtures of von Mises distributions. Two main drawbacks of the EM algorithm are its slow convergence and the dependence of the solution on the initial value used. The choice of initial values is of great importance in the algorithm-based literature as it can heavily influence the speed of convergence of the algorithm and its ability to locate the global maximum. On the other hand, the algorithmic frameworks of EM and FCD are closely related. Therefore, the drawbacks of FCD are the same as those of the EM algorithm. To resolve these problems, this paper proposes another clustering algorithm, which can self-organize local optimal cluster numbers without using cluster validity functions. These numerical results clearly indicate that the proposed algorithm is superior in performance of EM and FCD algorithms. Finally, we apply the proposed algorithm to two real data sets. 相似文献

10.

Simple Measures of Individual Cluster-Membership Certainty for Hard Partitional Clustering

Dongmeng Liu Jinko Graham 《The American statistician》2019,73(1):70-79

We propose two probability-like measures of individual cluster-membership certainty that can be applied to a hard partition of the sample such as that obtained from the partitioning around medoids (PAM) algorithm, hierarchical clustering or k-means clustering. One measure extends the individual silhouette widths and the other is obtained directly from the pairwise dissimilarities in the sample. Unlike the classic silhouette, however, the measures behave like probabilities and can be used to investigate an individual’s tendency to belong to a cluster. We also suggest two possible ways to evaluate the hard partition using these measures. We evaluate the performance of both measures in individuals with ambiguous cluster membership, using simulated binary datasets that have been partitioned by the PAM algorithm or continuous datasets that have been partitioned by hierarchical clustering and k-means clustering. For comparison, we also present results from soft-clustering algorithms such as soft analysis clustering (FANNY) and two model-based clustering methods. Our proposed measures perform comparably to the posterior probability estimators from either FANNY or the model-based clustering methods. We also illustrate the proposed measures by applying them to Fisher’s classic dataset on irises. 相似文献

11.

A note on spectral estimation for ARMA processes

《Journal of Statistical Computation and Simulation》2012,82(2-3):87-92

The standard frequency domain approximation to the Gaussian likelihood of a sample from an ARMA process is considered. The Newton-Raphson and Gauss-Newton numerical maximisation algorithms are evaluated for this approximate likelihood and the relationships between these algorithms and those of Akaike and Hannan explored. In particular it is shown that Hannan's method has certain computational advantages compared to the other spectral estimation methods considered 相似文献

12.

Feature selection and machine learning with mass spectrometry data for distinguishing cancer and non-cancer samples

Susmita Datta Lara M. DePadilla 《Statistical Methodology》2006,3(1):79

This is a comparative study of various clustering and classification algorithms as applied to differentiate cancer and non-cancer protein samples using mass spectrometry data. Our study demonstrates the usefulness of a feature selection step prior to applying a machine learning tool. A natural and common choice of a feature selection tool is the collection of marginal p-values obtained from t-tests for testing the intensity differences at each m/z ratio in the cancer versus non-cancer samples. We study the effect of selecting a cutoff in terms of the overall Type 1 error rate control on the performance of the clustering and classification algorithms using the significant features. For the classification problem, we also considered m/z selection using the importance measures computed by the Random Forest algorithm of Breiman. Using a data set of proteomic analysis of serum from ovarian cancer patients and serum from cancer-free individuals in the Food and Drug Administration and National Cancer Institute Clinical Proteomics Database, we undertake a comparative study of the net effect of the machine learning algorithm–feature selection tool–cutoff criteria combination on the performance as measured by an appropriate error rate measure. 相似文献

13.

An unsupervised, ensemble clustering algorithm: A new approach for classification of X-ray sources

S.M. Hojnacki G. Micela S.M. LaLonde E.D. Feigelson J.H. Kastner 《Statistical Methodology》2008,5(4):350-360

A large volume of CCD X-ray spectra is being generated by the Chandra X-ray Observatory (Chandra) and XMM-Newton. Automated spectral analysis and classification methods can aid in sorting, characterizing, and classifying this large volume of CCD X-ray spectra in a non-parametric fashion, complementary to current parametric model fits. We have developed an algorithm that uses multivariate statistical techniques, including an ensemble clustering method, applied for the first time for X-ray spectral classification. The algorithm uses spectral data to group similar discrete sources of X-ray emission by placing the X-ray sources in a three-dimensional spectral sequence and then grouping the ordered sources into clusters based on their spectra. This new method can handle large quantities of data and operate independently of the requirement of spectral source models and a priori knowledge concerning the nature of the sources (i.e., young stars, interacting binaries, active galactic nuclei). We apply the method to Chandra imaging spectroscopy of the young stellar clusters in the Orion Nebula Cluster and the NGC 1333 star formation region. 相似文献

14.

Clustering transformed compositional data using K-means,with applications in gene expression and bicycle sharing system data

Antoine Godichon-Baggioni Cathy Maugis-Rabusseau Andrea Rau 《Journal of applied statistics》2019,46(1):47-65

Although there is no shortage of clustering algorithms proposed in the literature, the question of the most relevant strategy for clustering compositional data (i.e. data whose rows belong to the simplex) remains largely unexplored in cases where the observed value is equal or close to zero for one or more samples. This work is motivated by the analysis of two applications, both focused on the categorization of compositional profiles: (1) identifying groups of co-expressed genes from high-throughput RNA sequencing data, in which a given gene may be completely silent in one or more experimental conditions; and (2) finding patterns in the usage of stations over the course of one week in the Velib' bicycle sharing system in Paris, France. For both of these applications, we make use of appropriately chosen data transformations, including the Centered Log Ratio and a novel extension called the Log Centered Log Ratio, in conjunction with the K-means algorithm. We use a non-asymptotic penalized criterion, whose penalty is calibrated with the slope heuristics, to select the number of clusters. Finally, we illustrate the performance of this clustering strategy, which is implemented in the Bioconductor package coseq, on both the gene expression and bicycle sharing system data. 相似文献

15.

Clustering probability distributions

Tai Vo Van 《Journal of applied statistics》2010,37(11):1891-1910

This article presents some theoretical results on the maximum of several functions, and its use to define the joint distance of k probability densities, which, in turn, serves to derive new algorithms for clustering densities. Numerical examples are presented to illustrate the theory. 相似文献

16.

Constrained clustering of irregularly sampled spatial data

《Journal of Statistical Computation and Simulation》2012,82(12):853-865

Many spatial data such as those in climatology or environmental monitoring are collected over irregular geographical locations. Furthermore, it is common to have multivariate observations at each location. We propose a method of segmentation of a region of interest based on such data that can be carried out in two steps: (1) clustering or classification of irregularly sample points and (2) segmentation of the region based on the classified points.

We develop a spatially-constrained clustering algorithm for segmentation of the sample points by incorporating a geographical-constraint into the standard clustering methods. Both hierarchical and nonhierarchical methods are considered. The latter is a modification of the seeded region growing method known in image analysis. Both algorithms work on a suitable neighbourhood structure, which can for example be defined by the Delaunay triangulation of the sample points. The number of clusters is estimated by testing the significance of successive change in the within-cluster sum-of-squares relative to a null permutation distribution. The methodology is validated on simulated data and used in construction of a climatology map of Ireland based on meteorological data of daily rainfall records from 1294 stations over the period of 37 years. 相似文献

17.

Impact of Contamination on Training and Test Error Rates in Statistical Clustering

C. Ruwet G. Haesbroeck 《统计学通讯:模拟与计算》2013,42(3):394-411

The k-means algorithm is one of the most common non hierarchical methods of clustering. It aims to construct clusters in order to minimize the within cluster sum of squared distances. However, as most estimators defined in terms of objective functions depending on global sums of squares, the k-means procedure is not robust with respect to atypical observations in the data. Alternative techniques have thus been introduced in the literature, e.g., the k-medoids method. The k-means and k-medoids methodologies are particular cases of the generalized k-means procedure. In this article, focus is on the error rate these clustering procedures achieve when one expects the data to be distributed according to a mixture distribution. Two different definitions of the error rate are under consideration, depending on the data at hand. It is shown that contamination may make one of these two error rates decrease even under optimal models. The consequence of this will be emphasized with the comparison of influence functions and breakdown points of these error rates. 相似文献

18.

Modular-transform based clustering

Gang Wang Jun Wang Mingyu Wang 《Journal of applied statistics》2013,40(12):2749-2759

Spectral clustering uses eigenvectors of the Laplacian of the similarity matrix. It is convenient to solve binary clustering problems. When applied to multi-way clustering, either the binary spectral clustering is recursively applied or an embedding to spectral space is done and some other methods, such as K-means clustering, are used to cluster the points. Here we propose and study a K-way clustering algorithm – spectral modular transformation, based on the fact that the graph Laplacian has an equivalent representation, which has a diagonal modular structure. The method first transforms the original similarity matrix into a new one, which is nearly disconnected and reveals a cluster structure clearly, then we apply linearized cluster assignment algorithm to split the clusters. In this way, we can find some samples for each cluster recursively using the divide and conquer method. To get the overall clustering results, we apply the cluster assignment obtained in the previous step as the initialization of multiplicative update method for spectral clustering. Examples show that our method outperforms spectral clustering using other initializations. 相似文献

19.

基于密度的面板数据聚类分析

杨娟谢远涛《统计与信息论坛》2014,(2):23-28

研究面板数据聚类问题过程中,在相似性度量上,用Logistic回归模型构造相似系数和非对称相似矩阵。在聚类算法上,目前的聚类算法只适用于对称的相似矩阵。在非对称相似矩阵的聚类算法上,采用最佳优先搜索和轮廓系数,改进DBSCAN聚类方法,提出BF—DBSCAN方法。通过实例分析,比较了BF—DBSCAN和DBSCAN方法的聚类结果,以及不同参数设置对BF—DBSCAN聚类结果的影响,验证了该方法的有效性和实用性。相似文献

20.

Optimized sampling by exchange methods

Thomas Bausch 《Statistical Papers》1988,29(1):205-218

A market research case study called for the elaboration of a database typology concerning about 3 million clients. When using clustering algorithms the selection of a relatively small multivariate sample (1000) is necessary. As the most common sampling techniques aim at univariate cases, criteria of goodness for multivariate samples shall be defined. It will be shown that in most of all cases the construction of multivariate criteria leads to insurmountable problems. For that reason a set of univariate criteria will be designed and followed by the definition of goal functions. These goal functions are based on the aggregation of univariate quality judgements and will be optimized by two different processes. As cluster analysis—like sampling—bear similar combinatorial optimization problems, exchanging heuristics are used in the optimization process. Eventually two sample surveys of a different structure will be presented, the results of which are based on parametric criteria within the case study. 相似文献