首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This article describes an algorithm for the identification of outliers in multivariate data based on the asymptotic theory for location estimation as described typically for the trimmed likelihood estimator and in particular for the minimum covariance determinant estimator. The strategy is to choose a subset of the data which minimizes an appropriate measure of the asymptotic variance of the multivariate location estimator. Observations not belonging to this subset are considered potential outliers which should be trimmed. For α less than about 0.5, the correct trimming proportion is taken to be that α > 0 for which the minimum of any minima of this measure of the asymptotic variance occurs. If no minima occur for an α > 0 then the data set will be considered outlier free.  相似文献   

2.
Despite the popularity of high dimension, low sample size data analysis, there has not been enough attention to the sample integrity issue, in particular, a possibility of outliers in the data. A new outlier detection procedure for data with much larger dimensionality than the sample size is presented. The proposed method is motivated by asymptotic properties of high-dimensional distance measures. Empirical studies suggest that high-dimensional outlier detection is more likely to suffer from a swamping effect rather than a masking effect, thus yields more false positives than false negatives. We compare the proposed approaches with existing methods using simulated data from various population settings. A real data example is presented with a consideration on the implication of found outliers.  相似文献   

3.
The process of detection of outliers is an interesting and important aspect in the analysis of data, as it could impact the inference. There are various methods available in the literature for detection of outliers in multivariate data [V. Barnett and T. Lewis, Outliers in Statistical Data, John Wiley & Sons, Chichester, 1994] using the Mahalanobis distance measure. An attempt is made to propose an alternate method of outlier detection based on the comedian introduced by Falk [On MAD and Comedians, Ann. Inst. Statist. Math. 49 (1997), pp. 615–644]. The proposed method is computationally efficient with high breakdown value and low computation time. Further, important properties, namely, success rates (SR) and false detection rates (FDR) are studied and compared with some of the well-known outlier detection methods through a simulation study. The Comedian method has high SR and low FDR for all combination of parameters. On removal of the detected outliers or down weighing, the same, highly robust and approximately affine equivariant estimators of multivariate location and scatter can be obtained. Finally, the method is applied to well-known real data sets to evaluate its performance.  相似文献   

4.
Summary.  We use the forward search to provide robust Mahalanobis distances to detect the presence of outliers in a sample of multivariate normal data. Theoretical results on order statistics and on estimation in truncated samples provide the distribution of our test statistic. We also introduce several new robust distances with associated distributional results. Comparisons of our procedure with tests using other robust Mahalanobis distances show the good size and high power of our procedure. We also provide a unification of results on correction factors for estimation from truncated samples.  相似文献   

5.

A basic graphical approach for checking normality is the Q - Q plot that compares sample quantiles against the population quantiles. In the univariate analysis, the probability plot correlation coefficient test for normality has been studied extensively. We consider testing the multivariate normality by using the correlation coefficient of the Q - Q plot. When multivariate normality holds, the sample squared distance should follow a chi-square distribution for large samples. The plot should resemble a straight line. A correlation coefficient test can be constructed by using the pairs of points in the probability plot. When the correlation coefficient test does not reject the null hypothesis, the sample data may come from a multivariate normal distribution or some other distributions. So, we use the following two steps to test multivariate normality. First, we check the multivariate normality by using the probability plot correction coefficient test. If the test does not reject the null hypothesis, then we test symmetry of the distribution and determine whether multivariate normality holds. This test procedure is called the combination test. The size and power of this test are studied, and it is found that the combination test, in general, is more powerful than other tests for multivariate normality.  相似文献   

6.
We propose a new regression-based filter for extracting signals online from multivariate high frequency time series. It separates relevant signals of several variables from noise and (multivariate) outliers.

Unlike parallel univariate filters, the new procedure takes into account the local covariance structure between the single time series components. It is based on high-breakdown estimates, which makes it robust against (patches of) outliers in one or several of the components as well as against outliers with respect to the multivariate covariance structure. Moreover, the trade-off problem between bias and variance for the optimal choice of the window width is approached by choosing the size of the window adaptively, depending on the current data situation.

Furthermore, we present an advanced algorithm of our filtering procedure that includes the replacement of missing observations in real time. Thus, the new procedure can be applied in online-monitoring practice. Applications to physiological time series from intensive care show the practical effect of the proposed filtering technique.  相似文献   

7.
The investigation on the identification of outliers in linear regression models can be extended to those for circular regression case. In this paper, we propose a new numerical statistic called mean circular error to identify possible outliers in circular regression models by using a row deletion approach. Through intensive simulation studies, the cut-off points of the statistic are obtained and its power of performance investigated. It is found that the performance improves as the concentration parameter of circular residuals becomes larger or the sample size becomes smaller. As an illustration, the statistic is applied to a wind direction data set.  相似文献   

8.
Data on the weights and heights of children 2-18 yeas old in Iran were obtained in a National Health Survey of 10 660 families in 1990-92. Data were 'cleaned' in 1 year age groups. After excluding gross outliers by inspection of bivariate scatter plots, Box-Cox power transformations were used to normalize the distributions of height and weight. If a multivariate Box-Cox power transformation to normality exists, then it is equivalent to normalizing the data variable by variable. After excluding gross outliers, exclusions based on the Mahalanobis distance were almost identical to those identified by Hadi's iterative procedure, because the percentages of outliers were small. In all, 1% of the observations were gross outliers and a further 0.4% were identified by multivariate analysis. Review of records showed that the outliers identified by multivariate analysis resulted from data-processing errors. After transformation and 'cleaning', the data quality was excellent and suitable for the construction of growth charts.  相似文献   

9.
Many methods have been developed for detecting multiple outliers in a single multivariate sample, but very few for the case where there may be groups in the data. We propose a method of simultaneously determining groups (as in cluster analysis) and detecting outliers, which are points that are distant from every group. Our method is an adaptation of the BACON algorithm proposed by Billor, Hadi and Velleman for the robust detection of multiple outliers in a single group of multivariate data. There are two versions of our method, depending on whether or not the groups can be assumed to have equal covariance matrices. The effectiveness of the method is illustrated by its application to two real data sets and further shown by a simulation study for different sample sizes and dimensions for 2 and 3 groups, with and without planted outliers in the data. When the number of groups is not known in advance, the algorithm could be used as a robust method of cluster analysis, by running it for various numbers of groups and choosing the best solution.  相似文献   

10.
The presence of outliers would inevitably lead to distorted analysis and inappropriate prediction, especially for multiple outliers in high-dimensional regression, where the high dimensionality of the data might amplify the chance of an observation or multiple observations being outlying. Noting that the detection of outliers is not only necessary but also important in high-dimensional regression analysis, we, in this paper, propose a feasible outlier detection approach in sparse high-dimensional linear regression model. Firstly, we search a clean subset by use of the sure independence screening method and the least trimmed square regression estimates. Then, we define a high-dimensional outlier detection measure and propose a multiple outliers detection approach through multiple testing procedures. In addition, to enhance efficiency, we refine the outlier detection rule after obtaining a relatively reliable non-outlier subset based on the initial detection approach. By comparison studies based on Monte Carlo simulation, it is shown that the proposed method performs well for detecting multiple outliers in sparse high-dimensional linear regression model. We further illustrate the application of the proposed method by empirical analysis of a real-life protein and gene expression data.  相似文献   

11.
The Forward Search is a powerful general method, incorporating flexible data-driven trimming, for the detection of outliers and unsuspected structure in data and so for building robust models. Starting from small subsets of data, observations that are close to the fitted model are added to the observations used in parameter estimation. As this subset grows we monitor parameter estimates, test statistics and measures of fit such as residuals. The paper surveys theoretical development in work on the Forward Search over the last decade. The main illustration is a regression example with 330 observations and 9 potential explanatory variables. Mention is also made of procedures for multivariate data, including clustering, time series analysis and fraud detection.  相似文献   

12.
ABSTRACT

Statistical methods are effectively used in the evaluation of pharmaceutical formulations instead of laborious liquid chromatography. However, signal overlapping, nonlinearity, multicollinearity and presence of outliers deteriorate the performance of statistical methods. The Partial Least Squares Regression (PLSR) is a very popular method in the quantification of high dimensional spectrally overlapped drug formulations. The SIMPLS is the mostly used PLSR algorithm, but it is highly sensitive to outliers that also effect the diagnostics. In this paper, we propose new robust multivariate diagnostics to identify outliers, influential observations and points causing non-normality for a PLSR model. We study performances of the proposed diagnostics on two everyday use highly overlapping drug systems: Paracetamol–Caffeine and Doxylamine Succinate–Pyridoxine Hydrochloride.  相似文献   

13.
A general way of detecting multivariate outliers involves using robust depth functions, or, equivalently, the corresponding ‘outlyingness’ functions; the more outlying an observation, the more extreme (less deep) it is in the data cloud and thus potentially an outlier. Most outlier detection studies in the literature assume that the underlying distribution is multivariate normal. This paper deals with the case of multivariate skewed data, specifically when the data follow the multivariate skew-normal [1] distribution. We compare the outlier detection capabilities of four robust outlier detection methods through their outlyingness functions in a simulation study. Two scenarios are considered for the occurrence of outliers: ‘the cluster’ and ‘the radial’. Conclusions and recommendations are offered for each scenario.  相似文献   

14.
Detecting outliers in a multivariate point cloud is not trivial, especially when dealing with a sizable fraction of contamination. Over time, it has increasingly been recognized that the safest and most feasible approach to exposing outliers starts by computing a highly robust estimator of location and scatter that can withstand a large proportion of contamination. Many such estimators have been proposed in recent years. We will compare the worst-case bias of several prominent robust multivariate estimators by means of simulation. We also propose a new tool to compare robust estimators on real data sets, and illustrate it.  相似文献   

15.
The local influence method introduced by Cook is adapted to multivariate normal data for the purpose of detecting outliers. The method allows simultaneous perturbations on all observations, so that it can identify multiple outliers. An illustrative example is given to show the e ectiveness of the method for the identification of influential observations.  相似文献   

16.
Robust statistics have slowly become familiar to all practitioners. Books entirely devoted to the subject (e.g. [R.A. Maronna, R.D. Martin, V.J. Yohai, Robust Statistics: Theory and Methods. John Wiley &; Sons, New York, NY, USA, 2006; P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection, John Wiley &; Sons, New York, NY, USA, 1987], …) are without any doubt responsible for the increased practice of robust statistics in all fields of applications. Even classical books often have at least one chapter (or parts of chapters) which develops robust methodology. The improvement of computing power has also contributed to the development of a wider and wider range of available robust procedures. However, this success story is now menacing to get backwards: non-specialists interested in the application of robust methodology are faced with a large set of (assumed equivalent) methods and with over-sophistication of some of them. Which method should one use? How should the (numerous) parameters be optimally tuned? These questions are not so easy to answer for non-specialists! One could then argue that default procedures are available in most statistical software (Splus, R, SAS, Matlab, …). However, using as illustration the detection of outliers in multivariate data, it is shown that, on one hand, it is not obvious that one would feel confident with the output of default procedures, and that, on the other hand, trying to understand thoroughly the tuning parameters involved in the procedures might require some extensive research. This is not conceivable when trying to compete with the classical methodology which (while clearly unreliable) is so straightforward. The aim of the paper is to help the practitioners willing to detect in a reliable way outliers in a multivariate data set. The chosen methodology is the Minimum Covariance Determinant estimator being widely available and intuitively appealing.  相似文献   

17.
The study of multivariate outliers raises many problems of definition, principle and manipulation. Well-authenticated tests of discordancy exist only for the multivariate normal distribution. Detection of outliers in non-normal distributions involves the adoption of appropriate criteria to represent 'extremeness' of observations in a sample; corresponding tests of discordancy usually require tedious, or even intractable, distributional and computational manipulations. A class of transformations of the data is considered with a view of transferring some of the familiar and desirable features of discordancy tests for normal samples to non-normal situations.  相似文献   

18.
Mixtures of multivariate t distributions provide a robust parametric extension to the fitting of data with respect to normal mixtures. In presence of some noise component, potential outliers or data with longer-than-normal tails, one way to broaden the model can be provided by considering t distributions. In this framework, the degrees of freedom can act as a robustness parameter, tuning the heaviness of the tails, and downweighting the effect of the outliers on the parameters estimation. The aim of this paper is to extend to mixtures of multivariate elliptical distributions some theoretical results about the likelihood maximization on constrained parameter spaces. Further, a constrained monotone algorithm implementing maximum likelihood mixture decomposition of multivariate t distributions is proposed, to achieve improved convergence capabilities and robustness. Monte Carlo numerical simulations and a real data study illustrate the better performance of the algorithm, comparing it to earlier proposals.  相似文献   

19.
In fitting regression model, one or more observations may have substantial effects on estimators. These unusual observations are precisely detected by a new diagnostic measure, Pena's statistic. In this article, we introduce a type of Pena's statistic for each point in Liu regression. Using the forecast change property, we simplify the Pena's statistic in a numerical sense. It is found that the simplified Pena's statistic behaves quite well as far as detection of influential observations is concerned. We express Pena's statistic in terms of the Liu leverages and residuals. The normality of this statistic is also discussed and it is demonstrated that it can identify a subset of high Liu leverage outliers. For numerical evaluation, simulated studies are given and a real data set has been analysed for illustration.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号