首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.

Kaufman and Rousseeuw (1990) proposed a clustering algorithm Partitioning Around Medoids (PAM) which maps a distance matrix into a specified number of clusters. A particularly nice property is that PAM allows clustering with respect to any specified distance metric. In addition, the medoids are robust representations of the cluster centers, which is particularly important in the common context that many elements do not belong well to any cluster. Based on our experience in clustering gene expression data, we have noticed that PAM does have problems recognizing relatively small clusters in situations where good partitions around medoids clearly exist. In this paper, we propose to partition around medoids by maximizing a criteria "Average Silhouette" defined by Kaufman and Rousseeuw (1990). We also propose a fast-to-compute approximation of "Average Silhouette". We implement these two new partitioning around medoids algorithms and illustrate their performance relative to existing partitioning methods in simulations.  相似文献   

2.
We complement the work of Cerioli, Riani, Atkinson and Corbellini by discussing monitoring in the context of robust clustering. This implies extending the approach to clustering, and possibly monitoring more than one parameter simultaneously. The cases of trimming and snipping are discussed separately, and special attention is given to recently proposed methods like double clustering, reweighting in robust clustering, and fuzzy regression clustering.  相似文献   

3.
Many recent applications of nonparametric Bayesian inference use random partition models, i.e. probability models for clustering a set of experimental units. We review the popular basic constructions. We then focus on an interesting extension of such models. In many applications covariates are available that could be used to a priori inform the clustering. This leads to random clustering models indexed by covariates, i.e., regression models with the outcome being a partition of the experimental units. We discuss some alternative approaches that have been used in the recent literature to implement such models, with an emphasis on a recently proposed extension of product partition models. Several of the reviewed approaches were not originally intended as covariate-based random partition models, but can be used for such inference.  相似文献   

4.
Model-based clustering is a flexible grouping technique based on fitting finite mixture models to data groups. Despite its rapid development in recent years, there is rather limited literature devoted to developing diagnostic tools for obtained clustering solutions. In this paper, a new method through fuzzy variation decomposition is proposed for probabilistic assessing contribution of variables to a detected dataset partition. Correlation between-variable contributions reveals the underlying variable interaction structure. A visualization tool illustrates whether two variables work collaboratively or exclusively in the model. Elimination of negative-effect variables in the partition leads to better classification results. The developed technique is employed on real-life datasets with promising results.  相似文献   

5.
Partitioning objects into closely related groups that have different states allows to understand the underlying structure in the data set treated. Different kinds of similarity measure with clustering algorithms are commonly used to find an optimal clustering or closely akin to original clustering. Using shrinkage-based and rank-based correlation coefficients, which are known to be robust, the recovery level of six chosen clustering algorithms is evaluated using Rand’s C values. The recovery levels using weighted likelihood estimate of correlation coefficient are obtained and compared to the results from using those correlation coefficients in applying agglomerative clustering algorithms. This work was supported by RIC(R) grants from Traditional and Bio-Medical Research Center, Daejeon University (RRC04713, 2005) by ITEP in Republic of Korea.  相似文献   

6.
Silhouette information evaluates the quality of the partition detected by a clustering technique. Since it is based on a measure of distance between the clustered observations, its standard formulation is not adequate when a density-based clustering technique is used. In this work we propose a suitable modification of the Silhouette information aimed at evaluating the quality of clusters in a density-based framework. It is based on the estimation of the data posterior probabilities of belonging to the clusters and may be used to measure our confidence about data allocation to the clusters as well as to choose the best partition among different ones.  相似文献   

7.
We propose two probability-like measures of individual cluster-membership certainty that can be applied to a hard partition of the sample such as that obtained from the partitioning around medoids (PAM) algorithm, hierarchical clustering or k-means clustering. One measure extends the individual silhouette widths and the other is obtained directly from the pairwise dissimilarities in the sample. Unlike the classic silhouette, however, the measures behave like probabilities and can be used to investigate an individual’s tendency to belong to a cluster. We also suggest two possible ways to evaluate the hard partition using these measures. We evaluate the performance of both measures in individuals with ambiguous cluster membership, using simulated binary datasets that have been partitioned by the PAM algorithm or continuous datasets that have been partitioned by hierarchical clustering and k-means clustering. For comparison, we also present results from soft-clustering algorithms such as soft analysis clustering (FANNY) and two model-based clustering methods. Our proposed measures perform comparably to the posterior probability estimators from either FANNY or the model-based clustering methods. We also illustrate the proposed measures by applying them to Fisher’s classic dataset on irises.  相似文献   

8.

The purpose of this paper is to show in regression clustering how to choose the most relevant solutions, analyze their stability, and provide information about best combinations of optimal number of groups, restriction factor among the error variance across groups and level of trimming. The procedure is based on two steps. First we generalize the information criteria of constrained robust multivariate clustering to the case of clustering weighted models. Differently from the traditional approaches which are based on the choice of the best solution found minimizing an information criterion (i.e. BIC), we concentrate our attention on the so called optimal stable solutions. In the second step, using the monitoring approach, we select the best value of the trimming factor. Finally, we validate the solution using a confirmatory forward search approach. A motivating example based on a novel dataset concerning the European Union trade of face masks shows the limitations of the current existing procedures. The suggested approach is initially applied to a set of well known datasets in the literature of robust regression clustering. Then, we focus our attention on a set of international trade datasets and we provide a novel informative way of updating the subset in the random start approach. The Supplementary material, in the spirit of the Special Issue, deepens the analysis of trade data and compares the suggested approach with the existing ones available in the literature.

  相似文献   

9.
This study develops a robust automatic algorithm for clustering probability density functions based on the previous research. Unlike other existing methods that often pre-determine the number of clusters, this method can self-organize data groups based on the original data structure. The proposed clustering method is also robust in regards to noise. Three examples of synthetic data and a real-world COREL dataset are utilized to illustrate the accurateness and effectiveness of the proposed approach.  相似文献   

10.
We propose a new robust regression estimator using data partition technique and M estimation (DPM). The data partition technique is designed to define a small fixed number of subsets of the partitioned data set and to produce corresponding ordinary least square (OLS) fits in each subset, contrary to the resampling technique of existing robust estimators such as the least trimmed squares estimator. The proposed estimator shares a common strategy with the median ball algorithm estimator that is obtained from the OLS trial fits only on a fixed number of subsets of the data. We examine performance of the DPM estimator in the eleven challenging data sets and simulation studies. We also compare the DPM with the five commonly used robust estimators using empirical convergence rates relative to the OLS for clean data, robustness through mean squared error and bias, masking and swamping probabilities, the ability of detecting the known outliers, and the regression and affine equivariances.  相似文献   

11.
Summary. We present a decision theoretic formulation of product partition models (PPMs) that allows a formal treatment of different decision problems such as estimation or hypothesis testing and clustering methods simultaneously. A key observation in our construction is the fact that PPMs can be formulated in the context of model selection. The underlying partition structure in these models is closely related to that arising in connection with Dirichlet processes. This allows a straightforward adaptation of some computational strategies—originally devised for nonparametric Bayesian problems—to our framework. The resulting algorithms are more flexible than other competing alternatives that are used for problems involving PPMs. We propose an algorithm that yields Bayes estimates of the quantities of interest and the groups of experimental units. We explore the application of our methods to the detection of outliers in normal and Student t regression models, with clustering structure equivalent to that induced by a Dirichlet process prior. We also discuss the sensitivity of the results considering different prior distributions for the partitions.  相似文献   

12.
A nonparametric test for the presence of clustering in survival data is proposed. Assuming a model that incorporates the clustering effect into the Cox Proportional Hazards model, simulation studies indicate that the procedure is correctly sized and powerful in a reasonably wide range of scenarios. The test for the presence of clustering over time is also robust to model misspecification. With large number of clusters, the test is powerful even if the data is highly heterogeneous.  相似文献   

13.
Covariate informed product partition models incorporate the intuitively appealing notion that individuals or units with similar covariate values a priori have a higher probability of co-clustering than those with dissimilar covariate values. These methods have been shown to perform well if the number of covariates is relatively small. However, as the number of covariates increase, their influence on partition probabilities overwhelm any information the response may provide in clustering and often encourage partitions with either a large number of singleton clusters or one large cluster resulting in poor model fit and poor out-of-sample prediction. This same phenomenon is observed in Bayesian nonparametric regression methods that induce a conditional distribution for the response given covariates through a joint model. In light of this, we propose two methods that calibrate the covariate-dependent partition model by capping the influence that covariates have on partition probabilities. We demonstrate the new methods’ utility using simulation and two publicly available datasets.  相似文献   

14.
We introduce a robust clustering procedure for parsimonious model-based clustering. The classical mclust framework is robustified through impartial trimming and eigenvalue-ratio constraints (the tclust framework, which is robust but not affine invariant). An advantage of our resulting mtclust approach is that eigenvalue-ratio constraints are not needed for certain model formulations, leading to affine invariant robust parsimonious clustering. We illustrate the approach via simulations and a benchmark real data example. R code for the proposed method is available at https://github.com/afarcome/mtclust.  相似文献   

15.
Trimming principles play an important role in robust statistics. However, their use for clustering typically requires some preliminary information about the contamination rate and the number of groups. We suggest a fresh approach to trimming that does not rely on this knowledge and that proves to be particularly suited for solving problems in robust cluster analysis. Our approach replaces the original K‐population (robust) estimation problem with K distinct one‐population steps, which take advantage of the good breakdown properties of trimmed estimators when the trimming level exceeds the usual bound of 0.5. In this setting, we prove that exact affine equivariance is lost on one hand but, on the other hand, an arbitrarily high breakdown point can be achieved by “anchoring” the robust estimator. We also support the use of adaptive trimming schemes, in order to infer the contamination rate from the data. A further bonus of our methodology is its ability to provide a reliable choice of the usually unknown number of groups.  相似文献   

16.
One of the most popular methods and algorithms to partition data to k clusters is k-means clustering algorithm. Since this method relies on some basic conditions such as, the existence of mean and finite variance, it is unsuitable for data that their variances are infinite such as data with heavy tailed distribution. Pitman Measure of Closeness (PMC) is a criterion to show how much an estimator is close to its parameter with respect to another estimator. In this article using PMC, based on k-means clustering, a new distance and clustering algorithm is developed for heavy tailed data.  相似文献   

17.
Mixture model-based clustering is widely used in many applications. In certain real-time applications the rapid increase of data size with time makes classical clustering algorithms too slow. An online clustering algorithm based on mixture models is presented in the context of a real-time flaw-diagnosis application for pressurized containers which uses data from acoustic emission signals. The proposed algorithm is a stochastic gradient algorithm derived from the classification version of the EM algorithm (CEM). It provides a model-based generalization of the well-known online k-means algorithm, able to handle non-spherical clusters. Using synthetic and real data sets, the proposed algorithm is compared with the batch CEM algorithm and the online EM algorithm. The three approaches generate comparable solutions in terms of the resulting partition when clusters are relatively well separated, but online algorithms become faster as the size of the available observations increases.  相似文献   

18.
A cluster methodology, motivated by a robust similarity matrix is proposed for identifying likely multivariate outlier structure and to estimate weighted least-square (WLS) regression parameters in linear models. The proposed method is an agglomeration of procedures that begins from clustering the n-observations through a test of ‘no-outlier hypothesis’ (TONH) to a weighted least-square regression estimation. The cluster phase partition the n-observations into h-set called main cluster and a minor cluster of size n?h. A robust distance emerge from the main cluster upon which a test of no outlier hypothesis’ is conducted. An initial WLS regression estimation is computed from the robust distance obtained from the main cluster. Until convergence, a re-weighted least-squares (RLS) regression estimate is updated with weights based on the normalized residuals. The proposed procedure blends an agglomerative hierarchical cluster analysis of a complete linkage through the TONH to the Re-weighted regression estimation phase. Hence, we propose to call it cluster-based re-weighted regression (CBRR). The CBRR is compared with three existing procedures using two data sets known to exhibit masking and swamping. The performance of CBRR is further examined through simulation experiment. The results obtained from the data set illustration and the Monte Carlo study shows that the CBRR is effective in detecting multivariate outliers where other methods are susceptible to it. The CBRR does not require enormous computation and is substantially not susceptible to masking and swamping.  相似文献   

19.
In this paper, we present a new algorithm for clustering proximity-relation matrix that does not require the transitivity property. The proposed algorithm is first inspired by the idea of Yang and Wu [16] then turned into a self-organizing process that is built upon the intuition behind clustering. At the end of the process subjects belonging to be the same cluster should converge to the same point, which represents the cluster center. However, the performance of Yang and Wu's algorithm depends on parameter selection. In this paper, we use the partition entropy (PE) index to choose it. Numerical result illustrates that the proposed method does not only solve the parameter selection problem but also obtains an optimal clustering result. Finally, we apply the proposed algorithm to three applications. One is to evaluate the performance of higher education in Taiwan, another is machine–parts grouping in cellular manufacturing systems, and the other is to cluster probability density functions.  相似文献   

20.
The problem of finding the most robust γ-level credible region for the parameter of interest in the presence of a nuisance parameter, with respect to a class of ε-contaminated priors, is studied. The case of arbitrary con-taminations is first analyzed; it is proved that the most robust region for the parameter of interest is theγ-level highest marginal likelihood region (forγ ≥ 0.5). Then, the result is extended to any measurable (not necessarily one-to-one) function of the parameter. Finally, the case of contaminations assigning fixed probabilities to the sets of a partition of the parameter space is analyzed and a partial result is given.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号