Impact of Contamination on Training and Test Error Rates in Statistical Clustering |
| |
Authors: | C. Ruwet G. Haesbroeck |
| |
Affiliation: | 1. Department of Mathematics , University of Liège , Liège, Belgium cruwet@ulg.ac.be;3. Department of Mathematics , University of Liège , Liège, Belgium |
| |
Abstract: | The k-means algorithm is one of the most common non hierarchical methods of clustering. It aims to construct clusters in order to minimize the within cluster sum of squared distances. However, as most estimators defined in terms of objective functions depending on global sums of squares, the k-means procedure is not robust with respect to atypical observations in the data. Alternative techniques have thus been introduced in the literature, e.g., the k-medoids method. The k-means and k-medoids methodologies are particular cases of the generalized k-means procedure. In this article, focus is on the error rate these clustering procedures achieve when one expects the data to be distributed according to a mixture distribution. Two different definitions of the error rate are under consideration, depending on the data at hand. It is shown that contamination may make one of these two error rates decrease even under optimal models. The consequence of this will be emphasized with the comparison of influence functions and breakdown points of these error rates. |
| |
Keywords: | Clustering analysis Error rate Generalized k-means Influence function Principal points Robustness |
|
|