Distance-based outlier detection for high dimension,low sample size data |
| |
Authors: | Jeongyoun Ahn Myung Hee Lee Jung Ae Lee |
| |
Affiliation: | 1. Department of Statistics, University of Georgia, Athens, GA, USA;2. Center for Global Health, Department of Medicine, Weill Cornell Medical College, New York City, NY, USA;3. Agricultural Statistics Laboratory, University of Arkansas, Fayetteville, AR, USA |
| |
Abstract: | Despite the popularity of high dimension, low sample size data analysis, there has not been enough attention to the sample integrity issue, in particular, a possibility of outliers in the data. A new outlier detection procedure for data with much larger dimensionality than the sample size is presented. The proposed method is motivated by asymptotic properties of high-dimensional distance measures. Empirical studies suggest that high-dimensional outlier detection is more likely to suffer from a swamping effect rather than a masking effect, thus yields more false positives than false negatives. We compare the proposed approaches with existing methods using simulated data from various population settings. A real data example is presented with a consideration on the implication of found outliers. |
| |
Keywords: | Centroid distance HDLSS high-dimensional asymptotics maximal data piling distance multiple outliers |
|
|