首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A tutorial on spectral clustering   总被引:33,自引:0,他引:33  
In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed.  相似文献   

2.
Currently, extreme large-scale genetic data present significant challenges for cluster analysis. Most of the existing clustering methods are typically built on the Euclidean distance and geared toward analyzing continuous response. They work well for clustering, e.g. microarray gene expression data, but often perform poorly for clustering, e.g. large-scale single nucleotide polymorphism (SNP) data. In this paper, we study the penalized latent class model for clustering extremely large-scale discrete data. The penalized latent class model takes into account the discrete nature of the response using appropriate generalized linear models and adopts the lasso penalized likelihood approach for simultaneous model estimation and selection of important covariates. We develop very efficient numerical algorithms for model estimation based on the iterative coordinate descent approach and further develop the expectation–maximization algorithm to incorporate and model missing values. We use simulation studies and applications to the international HapMap SNP data to illustrate the competitive performance of the penalized latent class model.  相似文献   

3.
In observational studies, unbalanced observed covariates between treatment groups often cause biased inferences on the estimation of treatment effects. Recently, generalized propensity score (GPS) has been proposed to overcome this problem; however, a practical technique to apply the GPS is lacking. This study demonstrates how clustering algorithms can be used to group similar subjects based on transformed GPS. We compare four popular clustering algorithms: k-means clustering (KMC), model-based clustering, fuzzy c-means clustering and partitioning around medoids based on the following three criteria: average dissimilarity between subjects within clusters, average Dunn index and average silhouette width under four various covariate scenarios. Simulation studies show that the KMC algorithm has overall better performance compared with the other three clustering algorithms. Therefore, we recommend using the KMC algorithm to group similar subjects based on the transformed GPS.  相似文献   

4.
Partitioning objects into closely related groups that have different states allows to understand the underlying structure in the data set treated. Different kinds of similarity measure with clustering algorithms are commonly used to find an optimal clustering or closely akin to original clustering. Using shrinkage-based and rank-based correlation coefficients, which are known to be robust, the recovery level of six chosen clustering algorithms is evaluated using Rand’s C values. The recovery levels using weighted likelihood estimate of correlation coefficient are obtained and compared to the results from using those correlation coefficients in applying agglomerative clustering algorithms. This work was supported by RIC(R) grants from Traditional and Bio-Medical Research Center, Daejeon University (RRC04713, 2005) by ITEP in Republic of Korea.  相似文献   

5.
We consider n individuals described by p variables, represented by points of the surface of unit hypersphere. We suppose that the individuals are fixed and the set of variables comes from a mixture of bipolar Watson distributions. For the mixture identification, we use EM and dynamic clusters algorithms, which enable us to obtain a partition of the set of variables into clusters of variables.

Our aim is to evaluate the clusters obtained in these algorithms, using measures of within-groups variability and between-groups variability and compare these clusters with those obtained in other clustering approaches, by analyzing simulated and real data.  相似文献   

6.
One of the most popular methods and algorithms to partition data to k clusters is k-means clustering algorithm. Since this method relies on some basic conditions such as, the existence of mean and finite variance, it is unsuitable for data that their variances are infinite such as data with heavy tailed distribution. Pitman Measure of Closeness (PMC) is a criterion to show how much an estimator is close to its parameter with respect to another estimator. In this article using PMC, based on k-means clustering, a new distance and clustering algorithm is developed for heavy tailed data.  相似文献   

7.
We propose two probability-like measures of individual cluster-membership certainty that can be applied to a hard partition of the sample such as that obtained from the partitioning around medoids (PAM) algorithm, hierarchical clustering or k-means clustering. One measure extends the individual silhouette widths and the other is obtained directly from the pairwise dissimilarities in the sample. Unlike the classic silhouette, however, the measures behave like probabilities and can be used to investigate an individual’s tendency to belong to a cluster. We also suggest two possible ways to evaluate the hard partition using these measures. We evaluate the performance of both measures in individuals with ambiguous cluster membership, using simulated binary datasets that have been partitioned by the PAM algorithm or continuous datasets that have been partitioned by hierarchical clustering and k-means clustering. For comparison, we also present results from soft-clustering algorithms such as soft analysis clustering (FANNY) and two model-based clustering methods. Our proposed measures perform comparably to the posterior probability estimators from either FANNY or the model-based clustering methods. We also illustrate the proposed measures by applying them to Fisher’s classic dataset on irises.  相似文献   

8.
The correspondence analysis (CA) method appears to be an effective tool for analysis of interrelations between rows and columns in two-way contingency data. A discrete version of the method, box clustering, is developed in the paper using an approximation version of the CA model extended to the case when CA factor values are required to be Boolean. Several properties of the proposed SEFIT-BOX algorithm are proved to facilitate interpretation of its output. It is also shown that two known partitioning algorithms (applied within row or column sets only) could be considered as locally optimal algorithms for fitting the model, and extensions of these algorithms to a simultaneous row and column partitioning problem are proposed.  相似文献   

9.
Reduced k‐means clustering is a method for clustering objects in a low‐dimensional subspace. The advantage of this method is that both clustering of objects and low‐dimensional subspace reflecting the cluster structure are simultaneously obtained. In this paper, the relationship between conventional k‐means clustering and reduced k‐means clustering is discussed. Conditions ensuring almost sure convergence of the estimator of reduced k‐means clustering as unboundedly increasing sample size have been presented. The results for a more general model considering conventional k‐means clustering and reduced k‐means clustering are provided in this paper. Moreover, a consistent selection of the numbers of clusters and dimensions is described.  相似文献   

10.
研究面板数据聚类问题过程中,在相似性度量上,用Logistic回归模型构造相似系数和非对称相似矩阵。在聚类算法上,目前的聚类算法只适用于对称的相似矩阵。在非对称相似矩阵的聚类算法上,采用最佳优先搜索和轮廓系数,改进DBSCAN聚类方法,提出BF—DBSCAN方法。通过实例分析,比较了BF—DBSCAN和DBSCAN方法的聚类结果,以及不同参数设置对BF—DBSCAN聚类结果的影响,验证了该方法的有效性和实用性。  相似文献   

11.
ABSTRACT

Among the statistical methods to model stochastic behaviours of objects, clustering is a preliminary technique to recognize similar patterns within a group of observations in a data set. Various distances to measure differences among objects could be invoked to cluster data through numerous clustering methods. When variables in hand contain geometrical information of objects, such metrics should be adequately adapted. In fact, statistical methods for these typical data are endowed with a geometrical paradigm in a multivariate sense. In this paper, a procedure for clustering shape data is suggested employing appropriate metrics. Then, the best shape distance candidate as well as a suitable agglomerative method for clustering the simulated shape data are provided by considering cluster validation measures. The results are implemented in a real life application.  相似文献   

12.
Summary.  A new procedure is proposed for clustering attribute value data. When used in conjunction with conventional distance-based clustering algorithms this procedure encourages those algorithms to detect automatically subgroups of objects that preferentially cluster on subsets of the attribute variables rather than on all of them simultaneously. The relevant attribute subsets for each individual cluster can be different and partially (or completely) overlap with those of other clusters. Enhancements for increasing sensitivity for detecting especially low cardinality groups clustering on a small subset of variables are discussed. Applications in different domains, including gene expression arrays, are presented.  相似文献   

13.
Abstract

K-means inverse regression was developed as an easy-to-use dimension reduction procedure for multivariate regression. This approach is similar to the original sliced inverse regression method, with the exception that the slices are explicitly produced by a K-means clustering of the response vectors. In this article, we propose K-medoids clustering as an alternative clustering approach for slicing and compare its performance to K-means in a simulation study. Although the two methods often produce comparable results, K-medoids tends to yield better performance in the presence of outliers. In addition to isolation of outliers, K-medoids clustering also has the advantage of accommodating a broader range of dissimilarity measures, which could prove useful in other graphical regression applications where slicing is required.  相似文献   

14.
This article presents an algebraic analysis of agglomerative clustering method algorithms, which results in a graphic portrayal of these algorithms and a classification scheme for these algorithms based on the degree of distortion perpetrated on the object space by the algorithms in each group.  相似文献   

15.
Basing on L1-distance and representing element of cluster, the article proposes new three algorithms in Fuzzy Clustering of probability density Functions (FCF). They are hierarchical approach, non-hierarchical approach and the algorithm to determine the optimal number of clusters and the initial partition matrix to improve the qualities of established clusters in non-hierarchical approach. With proposed algorithms, FCF has more advantageous than Non-fuzzy Clustering of probability density Functions. These algorithms are applied for recognizing images from Texture and Corel database and practical problem about studying and training marks of students at an university. Many Matlab programs are established for computation in proposed algorithms. These programs are not only used to compute effectively the numerical examples of this article but also to be applied for many different realistic problems.  相似文献   

16.
The aim of this study is to assign weights w 1, …, w m to m clustering variables Z 1, …, Z m , so that k groups were uncovered to reveal more meaningful within-group coherence. We propose a new criterion to be minimized, which is the sum of the weighted within-cluster sums of squares and the penalty for the heterogeneity in variable weights w 1, …, w m . We will present the computing algorithm for such k-means clustering, a working procedure to determine a suitable value of penalty constant and numerical examples, among which one is simulated and the other two are real.  相似文献   

17.
Unsupervised Curve Clustering using B-Splines   总被引:5,自引:0,他引:5  
Data in many different fields come to practitioners through a process naturally described as functional. Although data are gathered as finite vector and may contain measurement errors, the functional form have to be taken into account. We propose a clustering procedure of such data emphasizing the functional nature of the objects. The new clustering method consists of two stages: fitting the functional data by B‐splines and partitioning the estimated model coefficients using a k‐means algorithm. Strong consistency of the clustering method is proved and a real‐world example from food industry is given.  相似文献   

18.
We develop an index that effectively measures the level of social tension generated by income class segregation. We adopt the basic concepts of between-group difference (or alienation) and within-group similarity (or identification) from the income [bi]polarization literature; but we allow for asymmetric degrees of between-group antagonism in the alienation function, and construct a more effective identification function using both the relative degree of within-group clustering and the group size. To facilitate statistical inference, we derive the asymptotic distribution of the proposed measure using results from U-statistics. As the new measure is general enough to include existing income polarization indices as well as the Gini index as special cases, the asymptotic result can be readily applied to these popular indices. Evidence from the Panel Study of Income Dynamics data suggests that, while the level of social tension shows an upward trend over the sample period of 1981 to 2005, government’s taxes and transfers have been effective in reducing the level of social tension significantly.  相似文献   

19.
Cluster analysis is an important technique of explorative data mining. It refers to a collection of statistical methods for learning the structure of data by solely exploring pairwise distances or similarities. Often meaningful structures are not detectable in these high-dimensional feature spaces. Relevant features can be obfuscated by noise from irrelevant measurements. These observations led to the design of subspace clustering algorithms, which can identify clusters that originate from different subsets of features. Hunting for clusters in arbitrary subspaces is intractable due to the infinite search space spanned by all feature combinations. In this work, we present a subspace clustering algorithm that can be applied for exhaustively screening all feature combinations of small- or medium-sized datasets (approximately 30 features). Based on a robustness analysis via subsampling we are able to identify a set of stable candidate subspace cluster solutions.  相似文献   

20.
《统计学通讯:理论与方法》2012,41(16-17):3179-3197
Text clustering is an unsupervised process of classifying texts and words into different groups. In literature, many algorithms use a bag of words model to represent texts and classify contents. The bag of words model assumes that word order has no signicance. The aim of this article is to propose a new method of text clustering, considering links between terms and documents. We use centrality measures to assess word/text importance in a corpus and to sequentially classify documents.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号