期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A tutorial on spectral clustering 总被引：33，自引：0，他引：33

Ulrike von Luxburg 《Statistics and Computing》2007,17(4):395-416

In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed. 相似文献

2.

Simple Measures of Individual Cluster-Membership Certainty for Hard Partitional Clustering

Dongmeng Liu Jinko Graham 《The American statistician》2019,73(1):70-79

We propose two probability-like measures of individual cluster-membership certainty that can be applied to a hard partition of the sample such as that obtained from the partitioning around medoids (PAM) algorithm, hierarchical clustering or k-means clustering. One measure extends the individual silhouette widths and the other is obtained directly from the pairwise dissimilarities in the sample. Unlike the classic silhouette, however, the measures behave like probabilities and can be used to investigate an individual’s tendency to belong to a cluster. We also suggest two possible ways to evaluate the hard partition using these measures. We evaluate the performance of both measures in individuals with ambiguous cluster membership, using simulated binary datasets that have been partitioned by the PAM algorithm or continuous datasets that have been partitioned by hierarchical clustering and k-means clustering. For comparison, we also present results from soft-clustering algorithms such as soft analysis clustering (FANNY) and two model-based clustering methods. Our proposed measures perform comparably to the posterior probability estimators from either FANNY or the model-based clustering methods. We also illustrate the proposed measures by applying them to Fisher’s classic dataset on irises. 相似文献

3.

Comparison of clustering algorithms on generalized propensity score in observational studies: a simulation study

Chunhao Tu Shuo Jiao Woon Yuen Koh 《Journal of Statistical Computation and Simulation》2013,83(12):2206-2218

In observational studies, unbalanced observed covariates between treatment groups often cause biased inferences on the estimation of treatment effects. Recently, generalized propensity score (GPS) has been proposed to overcome this problem; however, a practical technique to apply the GPS is lacking. This study demonstrates how clustering algorithms can be used to group similar subjects based on transformed GPS. We compare four popular clustering algorithms: k-means clustering (KMC), model-based clustering, fuzzy c-means clustering and partitioning around medoids based on the following three criteria: average dissimilarity between subjects within clusters, average Dunn index and average silhouette width under four various covariate scenarios. Simulation studies show that the KMC algorithm has overall better performance compared with the other three clustering algorithms. Therefore, we recommend using the KMC algorithm to group similar subjects based on the transformed GPS. 相似文献

4.

Statistics and recognition for software birthmark based on clustering analysis

YangXia Luo 《Journal of applied statistics》2017,44(2):308-324

The result of feature selection for software birthmark has a direct bearing on software recognition rate. In this paper, we apply constrained clustering to analyze the software features (SF). The within-class (homogeneous software) and between-class (heterogeneous software) distances of features are measured based on mutual information. Information gain functions and penalty functions are constructed using homogeneous and heterogeneous SF, respectively; and redundancy is measured with correlation coefficients. Then the software birthmark features with high class distinction and minimum redundancy are selected. The example of extracting and detecting framework of birthmark feature is also given. The algorithm is analyzed and compared with the similar algorithms, and it is shown the algorithm provide an effective approach for software birthmark selection and optimization. 相似文献

5.

Clustering of Variables Around Latent Components 总被引：1，自引：0，他引：1

《统计学通讯:模拟与计算》2013,42(4):1131-1150

Abstract

Clustering of variables around latent components is investigated as a means to organize multivariate data into meaningful structures. The coverage includes (i) the case where it is desirable to lump together correlated variables no matter whether the correlation coefficient is positive or negative; (ii) the case where negative correlation shows high disagreement among variables; (iii) an extension of the clustering techniques which makes it possible to explain the clustering of variables taking account of external data. The strategy basically consists in performing a hierarchical cluster analysis, followed by a partitioning algorithm. Both algorithms aim at maximizing the same criterion which reflects the extent to which variables in each cluster are related to the latent variable associated with this cluster. Illustrations are outlined using real data sets from sensory studies. 相似文献

6.

A Pitman measure of similarity in k-means for clustering heavy-tailed data

Arman Reybod Javad Etminan Adel Mohammadpour 《统计学通讯:模拟与计算》2019,48(6):1595-1605

One of the most popular methods and algorithms to partition data to k clusters is k-means clustering algorithm. Since this method relies on some basic conditions such as, the existence of mean and finite variance, it is unsuitable for data that their variances are infinite such as data with heavy tailed distribution. Pitman Measure of Closeness (PMC) is a criterion to show how much an estimator is close to its parameter with respect to another estimator. In this article using PMC, based on k-means clustering, a new distance and clustering algorithm is developed for heavy tailed data. 相似文献

7.

Self-updating clustering algorithm for estimating the parameters in mixtures of von Mises distributions

Wen-Liang Hung Shou-Jen Chang-Chien Miin-Shen Yang 《Journal of applied statistics》2012,39(10):2259-2274

The EM algorithm is the standard method for estimating the parameters in finite mixture models. Yang and Pan [25] proposed a generalized classification maximum likelihood procedure, called the fuzzy c-directions (FCD) clustering algorithm, for estimating the parameters in mixtures of von Mises distributions. Two main drawbacks of the EM algorithm are its slow convergence and the dependence of the solution on the initial value used. The choice of initial values is of great importance in the algorithm-based literature as it can heavily influence the speed of convergence of the algorithm and its ability to locate the global maximum. On the other hand, the algorithmic frameworks of EM and FCD are closely related. Therefore, the drawbacks of FCD are the same as those of the EM algorithm. To resolve these problems, this paper proposes another clustering algorithm, which can self-organize local optimal cluster numbers without using cluster validity functions. These numerical results clearly indicate that the proposed algorithm is superior in performance of EM and FCD algorithms. Finally, we apply the proposed algorithm to two real data sets. 相似文献

8.

Unsupervised Curve Clustering using B-Splines 总被引：5，自引：0，他引：5

C. Abraham P. A. Cornillon E. Matzner-Løber N. Molinari 《Scandinavian Journal of Statistics》2003,30(3):581-595

Data in many different fields come to practitioners through a process naturally described as functional. Although data are gathered as finite vector and may contain measurement errors, the functional form have to be taken into account. We propose a clustering procedure of such data emphasizing the functional nature of the objects. The new clustering method consists of two stages: fitting the functional data by B‐splines and partitioning the estimated model coefficients using a k‐means algorithm. Strong consistency of the clustering method is proved and a real‐world example from food industry is given. 相似文献

9.

Clustering of Variables Based on Watson Distribution on Hypersphere: A Comparison of Algorithms

Adelaide Figueiredo Paulo Gomes 《统计学通讯:模拟与计算》2015,44(10):2622-2635

We consider n individuals described by p variables, represented by points of the surface of unit hypersphere. We suppose that the individuals are fixed and the set of variables comes from a mixture of bipolar Watson distributions. For the mixture identification, we use EM and dynamic clusters algorithms, which enable us to obtain a partition of the set of variables into clusters of variables.

Our aim is to evaluate the clusters obtained in these algorithms, using measures of within-groups variability and between-groups variability and compare these clusters with those obtained in other clustering approaches, by analyzing simulated and real data. 相似文献

10.

A novel fast heuristic to handle large-scale shape clustering

《Journal of Statistical Computation and Simulation》2012,82(1):160-169

Clustering algorithms like types of k-means are fast, but they are inefficient for shape clustering. There are some algorithms, which are effective, but their time complexities are too high. This paper proposes a novel heuristic to solve large-scale shape clustering. The proposed method is effective and it solves large-scale clustering problems in fraction of a second. 相似文献

11.

Linear Transformations and the k-Means Clustering Algorithm: Applications to Clustering Curves

Tarpey T 《The American statistician》2007,61(1):34-40

Functional data can be clustered by plugging estimated regression coefficients from individual curves into the k-means algorithm. Clustering results can differ depending on how the curves are fit to the data. Estimating curves using different sets of basis functions corresponds to different linear transformations of the data. k-means clustering is not invariant to linear transformations of the data. The optimal linear transformation for clustering will stretch the distribution so that the primary direction of variability aligns with actual differences in the clusters. It is shown that clustering the raw data will often give results similar to clustering regression coefficients obtained using an orthogonal design matrix. Clustering functional data using an L(2) metric on function space can be achieved by clustering a suitable linear transformation of the regression coefficients. An example where depressed individuals are treated with an antidepressant is used for illustration. 相似文献

12.

Feature selection and machine learning with mass spectrometry data for distinguishing cancer and non-cancer samples

Susmita Datta Lara M. DePadilla 《Statistical Methodology》2006,3(1):79

This is a comparative study of various clustering and classification algorithms as applied to differentiate cancer and non-cancer protein samples using mass spectrometry data. Our study demonstrates the usefulness of a feature selection step prior to applying a machine learning tool. A natural and common choice of a feature selection tool is the collection of marginal p-values obtained from t-tests for testing the intensity differences at each m/z ratio in the cancer versus non-cancer samples. We study the effect of selecting a cutoff in terms of the overall Type 1 error rate control on the performance of the clustering and classification algorithms using the significant features. For the classification problem, we also considered m/z selection using the importance measures computed by the Random Forest algorithm of Breiman. Using a data set of proteomic analysis of serum from ovarian cancer patients and serum from cancer-free individuals in the Food and Drug Administration and National Cancer Institute Clinical Proteomics Database, we undertake a comparative study of the net effect of the machine learning algorithm–feature selection tool–cutoff criteria combination on the performance as measured by an appropriate error rate measure. 相似文献

13.

基于密度的面板数据聚类分析

杨娟谢远涛《统计与信息论坛》2014,(2):23-28

研究面板数据聚类问题过程中,在相似性度量上,用Logistic回归模型构造相似系数和非对称相似矩阵。在聚类算法上,目前的聚类算法只适用于对称的相似矩阵。在非对称相似矩阵的聚类算法上,采用最佳优先搜索和轮廓系数,改进DBSCAN聚类方法,提出BF—DBSCAN方法。通过实例分析,比较了BF—DBSCAN和DBSCAN方法的聚类结果,以及不同参数设置对BF—DBSCAN聚类结果的影响,验证了该方法的有效性和实用性。相似文献

14.

Analytical proofs of classical inequalities between Spearman's and Kendall's

Christian Genest Johanna Ne&#x;lehov 《Journal of statistical planning and inference》2009,139(11):3795

Short analytical proofs are given for classical inequalities due to Daniels [1950. Rank correlation and population models. J. Roy. Statist. Soc. Ser. B 12, 171–181; 1951. Note on Durbin and Stuart's formula for E(r_s). J. Roy. Statist. Soc. Ser. B 13, 310] and Durbin and Stuart [1951. Inversions and rank correlation coefficients. J. Roy. Statist. Soc. Ser. B 13, 303–309] relating Spearman's ρ and Kendall's τ. 相似文献

15.

Four Correlation Coefficients with a Third Blocking Variable: Their Efficacy,Relative Efficiency,and Test Statistics

《统计学通讯:理论与方法》2013,42(9):1835-1858

Abstract

The efficacy and the asymptotic relative efficiency (ARE) of a weighted sum of Kendall's taus, a weighted sum of Spearman's rhos, a weighted sum of Pearson's r's, and a weighted sum of z-transformation of the Fisher–Yates correlation coefficients, in the presence of a blocking variable, are discussed. The method of selecting the weighting constants that maximize the efficacy of these four correlation coefficients is proposed. The estimate, test statistics and confidence interval of the four correlation coefficients with weights are also developed. To compare the small-sample properties of the four tests, a simulation study is performed. The theoretical and simulated results all prefer the weighted sum of the Pearson correlation coefficients with the optimal weights, as well as the weighted sum of z-transformation of the Fisher–Yates correlation coefficients with the optimal weights. 相似文献

16.

Model‐based clustering of longitudinal data

Paul D. McNicholas T. Brendan Murphy 《Revue canadienne de statistique》2010,38(1):153-168

A new family of mixture models for the model‐based clustering of longitudinal data is introduced. The covariance structures of eight members of this new family of models are given and the associated maximum likelihood estimates for the parameters are derived via expectation–maximization (EM) algorithms. The Bayesian information criterion is used for model selection and a convergence criterion based on the Aitken acceleration is used to determine the convergence of these EM algorithms. This new family of models is applied to yeast sporulation time course data, where the models give good clustering performance. Further constraints are then imposed on the decomposition to allow a deeper investigation of the correlation structure of the yeast data. These constraints greatly extend this new family of models, with the addition of many parsimonious models. The Canadian Journal of Statistics 38:153–168; 2010 © 2010 Statistical Society of Canada 相似文献

17.

Clustering large number of extragalactic spectra of galaxies and quasars through canopies

Tuli De Didier Fraix Burnet Asis Kumar Chattopadhyay 《统计学通讯:理论与方法》2013,42(9):2638-2653

Abstract

Cluster analysis is the distribution of objects into different groups or more precisely the partitioning of a data set into subsets (clusters) so that the data in subsets share some common trait according to some distance measure. Unlike classification, in clustering one has to first decide the optimum number of clusters and then assign the objects into different clusters. Solution of such problems for a large number of high dimensional data points is quite complicated and most of the existing algorithms will not perform properly. In the present work a new clustering technique applicable to large data set has been used to cluster the spectra of 702248 galaxies and quasars having 1,540 points in wavelength range imposed by the instrument. The proposed technique has successfully discovered five clusters from this 702,248X1,540 data matrix. 相似文献

18.

Hierarchical Variable Selection in Polynomial Regression Models

Julio L. Peixoto 《The American statistician》2013,67(4):311-313

Significance tests on coefficients of lower-order terms in polynomial regression models are affected by linear transformations. For this reason, a polynomial regression model that excludes hierarchically inferior predictors (i.e., lower-order terms) is considered to be not well formulated. Existing variable-selection algorithms do not take into account the hierarchy of predictors and often select as “best” a model that is not hierarchically well formulated. This article proposes a theory of the hierarchical ordering of the predictors of an arbitrary polynomial regression model in m variables, where m is any arbitrary positive integer. Ways of modifying existing algorithms to restrict their search to well-formulated models are suggested. An algorithm that generates all possible well-formulated models is presented. 相似文献

19.

On the computation of the noncentral <Emphasis Type="Italic">F</Emphasis> and noncentral beta distribution

Ali Baharev Sándor Kemény 《Statistics and Computing》2008,18(3):333-340

Unfortunately many of the numerous algorithms for computing the comulative distribution function (cdf) and noncentrality parameter of the noncentral F and beta distributions can produce completely incorrect results as demonstrated in the paper by examples. Existing algorithms are scrutinized and those parts that involve numerical difficulties are identified. As a result, a pseudo code is presented in which all the known numerical problems are resolved. This pseudo code can be easily implemented in programming language C or FORTRAN without understanding the complicated mathematical background. Symbolic evaluation of a finite and closed formula is proposed to compute exact cdf values. This approach makes it possible to check quickly and reliably the values returned by professional statistical packages over an extraordinarily wide parameter range without any programming knowledge. This research was motivated by the fact that a very useful table for calculating the size of detectable effects for ANOVA tables contains suspect values in the region of large noncentrality parameter values compared to the values obtained by Patnaik’s 2-moment central-F approximation. The cause is identified and the corrected form of the table for ANOVA purposes is given. The accuracy of the approximations to the noncentral-F distribution is also discussed. The authors wish to thank Mr. Richárd Király for his preliminary work. The authors are grateful to the Editor and Associate Editor of STCO and the unknown reviewers for their helpful suggestions. 相似文献

20.

Numerical computation of rectangular bivariate and trivariate normal and t probabilities 总被引：2，自引：0，他引：2

Alan Genz 《Statistics and Computing》2004,14(3):251-260

Algorithms for the computation of bivariate and trivariate normal and t probabilities for rectangles are reviewed. The algorithms use numerical integration to approximate transformed probability distribution integrals. A generalization of Plackett's formula is derived for bivariate and trivariate t probabilities. New methods are described for the numerical computation of bivariate and trivariate t probabilities. Test results are provided, along with recommendations for the most efficient algorithms for single and double precision computations. 相似文献