首页 | 本学科首页   官方微博 | 高级检索  
     


Impact of Contamination on Training and Test Error Rates in Statistical Clustering
Authors:C. Ruwet  G. Haesbroeck
Affiliation:1. Department of Mathematics , University of Liège , Liège, Belgium cruwet@ulg.ac.be;3. Department of Mathematics , University of Liège , Liège, Belgium
Abstract:The k-means algorithm is one of the most common non hierarchical methods of clustering. It aims to construct clusters in order to minimize the within cluster sum of squared distances. However, as most estimators defined in terms of objective functions depending on global sums of squares, the k-means procedure is not robust with respect to atypical observations in the data. Alternative techniques have thus been introduced in the literature, e.g., the k-medoids method. The k-means and k-medoids methodologies are particular cases of the generalized k-means procedure. In this article, focus is on the error rate these clustering procedures achieve when one expects the data to be distributed according to a mixture distribution. Two different definitions of the error rate are under consideration, depending on the data at hand. It is shown that contamination may make one of these two error rates decrease even under optimal models. The consequence of this will be emphasized with the comparison of influence functions and breakdown points of these error rates.
Keywords:Clustering analysis  Error rate  Generalized k-means  Influence function  Principal points  Robustness
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号