首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Two decompositions of the Mahalanobis distance are considered. These decompositions help to explain some reasons for the outlyingness of multivari- ate observations. They also provide a graphical tool for identifying outliers including those that have a large influence on the multiple correlation coefficient, Illustrative examples are given.  相似文献   

2.
A general way of detecting multivariate outliers involves using robust depth functions, or, equivalently, the corresponding ‘outlyingness’ functions; the more outlying an observation, the more extreme (less deep) it is in the data cloud and thus potentially an outlier. Most outlier detection studies in the literature assume that the underlying distribution is multivariate normal. This paper deals with the case of multivariate skewed data, specifically when the data follow the multivariate skew-normal [1] distribution. We compare the outlier detection capabilities of four robust outlier detection methods through their outlyingness functions in a simulation study. Two scenarios are considered for the occurrence of outliers: ‘the cluster’ and ‘the radial’. Conclusions and recommendations are offered for each scenario.  相似文献   

3.

Outlier detection is an inevitable step to most statistical data analyses. However, the mere detection of an outlying case does not always answer all scientific questions associated with that data point. Outlier detection techniques, classical and robust alike, will typically flag the entire case as outlying, or attribute a specific case weight to the entire case. In practice, particularly in high dimensional data, the outlier will most likely not be outlying along all of its variables, but just along a subset of them. If so, the scientific question why the case has been flagged as an outlier becomes of interest. In this article, a fast and efficient method is proposed to detect variables that contribute most to an outlier’s outlyingness. Thereby, it helps the analyst understand in which way an outlier lies out. The approach pursued in this work is to estimate the univariate direction of maximal outlyingness. It is shown that the problem of estimating that direction can be rewritten as the normed solution of a classical least squares regression problem. Identifying the subset of variables contributing most to outlyingness, can thus be achieved by estimating the associated least squares problem in a sparse manner. From a practical perspective, sparse partial least squares (SPLS) regression, preferably by the fast sparse NIPALS (SNIPLS) algorithm, is suggested to tackle that problem. The performed method is demonstrated to perform well both on simulated data and real life examples.

  相似文献   

4.
Sample quantile, rank, and outlyingness functions play long-established roles in univariate exploratory data analysis. In recent years, various multivariate generalizations have been formulated, among which the “spatial” approach has become especially well developed, including fully affine equivariant/invariant versions with but modest computational burden (24, 6, 34, 32 and 25). The only shortcoming of the spatial approach is that its robustness decreases to zero as the quantile or outlyingness level is chosen farther out from the center (Dang and Serfling, 2010). This is especially detrimental to exploratory data analysis procedures such as detection of outliers and delineation of the “middle” 50%, 75%, or 90% of the data set, for example. Here we develop suitably robust versions using a trimming approach. The improvements in robustness are illustrated and characterized using simulated and actual data. Also, as a byproduct of the investigation, a new robust, affine equivariant, and computationally easy scatter estimator is introduced.  相似文献   

5.
Several nonparametric tests for multivariate multi-sample location problem are proposed in this paper. These tests are based on the notion of data depth, which is used to measure the centrality/outlyingness of a given point with respect to a given distribution or a data cloud. Proposed tests are completely nonparametric and implemented through the idea of permutation tests. Performance of the proposed tests is compared with existing parametric test and nonparametric test based on data depth. An extensive simulation study reveals that proposed tests are superior to the existing tests based on data depth with regard to power. Illustrations with real data are provided.  相似文献   

6.
The present paper deals with the problem of testing equality of locations of two multivariate distributions using a notion of data depth. A notion of data depth has been used to measure centrality/outlyingness of a given point in a given data cloud. The paper proposes two nonparametric tests for testing equality of locations of two multivariate populations which are developed by observing the behavior of the depth versus depth plot. Simulation study reveals that the proposed tests are superior to the existing tests based on the data depth with regard to power. Illustrations with real data are provided.  相似文献   

7.
A notion of data depth is used to measure centrality or outlyingness of a data point in a given data cloud. In the context of data depth, the point (or points) having maximum depth is called as deepest point (or points). In the present work, we propose three multi-sample tests for testing equality of location parameters of multivariate populations by using the deepest point (or points). These tests can be considered as extensions of two-sample tests based on the deepest point (or points). The proposed tests are implemented through the idea of Fisher's permutation test. Performance of earlier tests is studied by simulation. Illustration with two real datasets is also provided.  相似文献   

8.
Density-based clustering methods hinge on the idea of associating groups to the connected components of the level sets of the density underlying the data, to be estimated by a nonparametric method. These methods claim some desirable properties and generally good performance, but they involve a non-trivial computational effort, required for the identification of the connected regions. In a previous work, the use of spatial tessellation such as the Delaunay triangulation has been proposed, because it suitably generalizes the univariate procedure for detecting the connected components. However, its computational complexity grows exponentially with the dimensionality of data, thus making the triangulation unfeasible for high dimensions. Our aim is to overcome the limitations of Delaunay triangulation. We discuss the use of an alternative procedure for identifying the connected regions associated to the level sets of the density. By measuring the extent of possible valleys of the density along the segment connecting pairs of observations, the proposed procedure shifts the formulation from a space with arbitrary dimension to a univariate one, thus leading benefits both in computation and visualization.  相似文献   

9.
Multivariate outlier detection requires computation of robust distances to be compared with appropriate cut-off points. In this paper we propose a new calibration method for obtaining reliable cut-off points of distances derived from the MCD estimator of scatter. These cut-off points are based on a more accurate estimate of the extreme tail of the distribution of robust distances. We show that our procedure gives reliable tests of outlyingness in almost all situations of practical interest, provided that the sample size is not much smaller than 50. Therefore, it is a considerable improvement over all the available MCD procedures, which are unable to provide good control over the size of multiple outlier tests for the data structures considered in this paper.  相似文献   

10.
Many spatial data such as those in climatology or environmental monitoring are collected over irregular geographical locations. Furthermore, it is common to have multivariate observations at each location. We propose a method of segmentation of a region of interest based on such data that can be carried out in two steps: (1) clustering or classification of irregularly sample points and (2) segmentation of the region based on the classified points.

We develop a spatially-constrained clustering algorithm for segmentation of the sample points by incorporating a geographical-constraint into the standard clustering methods. Both hierarchical and nonhierarchical methods are considered. The latter is a modification of the seeded region growing method known in image analysis. Both algorithms work on a suitable neighbourhood structure, which can for example be defined by the Delaunay triangulation of the sample points. The number of clusters is estimated by testing the significance of successive change in the within-cluster sum-of-squares relative to a null permutation distribution. The methodology is validated on simulated data and used in construction of a climatology map of Ireland based on meteorological data of daily rainfall records from 1294 stations over the period of 37 years.  相似文献   

11.
In extending univariate outlier detection methods to higher dimension, various issues arise: limited visualization methods, inadequacy of marginal methods, lack of a natural order, limited parametric modeling, and, when using Mahalanobis distance, restriction to ellipsoidal contours. To address and overcome such limitations, we introduce nonparametric multivariate outlier identifiers based on multivariate depth functions, which can generate contours following the shape of the data set. Also, we study masking robustness, that is, robustness against misidentification of outliers as nonoutliers. In particular, we define a masking breakdown point (MBP), adapting to our setting certain ideas of Davies and Gather [1993. The identification of multiple outliers (with discussion). Journal of the American Statistical Association 88, 782–801] and Becker and Gather [1999. The masking breakdown point of multivariate outlier identification rules. Journal of the American Statistical Association 94, 947–955] based on the Mahalanobis distance outlyingness. We then compare four affine invariant outlier detection procedures, based on Mahalanobis distance, halfspace or Tukey depth, projection depth, and “Mahalanobis spatial” depth. For the goal of threshold type outlier detection, it is found that the Mahalanobis distance and projection procedures are distinctly superior in performance, each with very high MBP, while the halfspace approach is quite inferior. When a moderate MBP suffices, the Mahalanobis spatial procedure is competitive in view of its contours not constrained to be elliptical and its computational burden relatively mild. A small sampling experiment yields findings completely in accordance with the theoretical comparisons. While these four depth procedures are relatively comparable for the purpose of robust affine equivariant location estimation, the halfspace depth is not competitive with the others for the quite different goal of robust setting of an outlyingness threshold.  相似文献   

12.
Although Hartigan (1975) had already put forward the idea of connecting identification of subpopulations with regions with high density of the underlying probability distribution, the actual development of methods for cluster analysis has largely shifted towards other directions, for computational convenience. Current computational resources allow us to reconsider this formulation and to develop clustering techniques directly in order to identify local modes of the density. Given a set of observations, a nonparametric estimate of the underlying density function is constructed, and subsets of points with high density are formed through suitable manipulation of the associated Delaunay triangulation. The method is illustrated with some numerical examples.  相似文献   

13.
In a cluster analysis of a multivariate data set, it may happen that one or two observations have a disproportionately large effect on the analysis, in the sense that their removal causes a dramatic change to the results. It is important to be able to identify such influential observations, and the present paper addresses this problem. To do so, we must first quantify the effect of a single observation. Various definitions are discussed, and criteria for identifying influential observations are investigated; the minimum spanning tree and the number of neighbours of each observation are considered. The investigation concentrates on single-link cluster analysis, although complete-link analysis is also briefly discussed. Patterns emerge in both real and simulated data, which suggest ways of predicting observations with no effect and those with the greatest effect. It is not necessary to recalculate the results with each observation omitted—an economy of presentation as well as labour.  相似文献   

14.
We propose an approach that utilizes the Delaunay triangulation to identify a robust/outlier-free subsample. Given that the data structure of the non-outlying points is convex (e.g. of elliptical shape), this subsample can then be used to give a robust estimation of location and scatter (by applying the classical mean and covariance). The estimators derived from our approach are shown to have a high breakdown point. In addition, we provide a diagnostic plot to expand the initial subset in a data-driven way, further increasing the estimators’ efficiency.  相似文献   

15.
We consider the problem of making inferences about extreme values from a sample. The underlying model distribution is the generalized extreme-value (GEV) distribution, and our interest is in estimating the parameters and quantiles of the distribution robustly. In doing this we find estimates for the GEV parameters based on that part of the data which is well fitted by a GEV distribution. The robust procedure will assign weights between 0 and 1 to each data point. A weight near 0 indicates that the data point is not well modelled by the GEV distribution which fits the points with weights at or near 1. On the basis of these weights we are able to assess the validity of a GEV model for our data. It is important that the observations with low weights be carefully assessed to determine whether diey are valid observations or not. If they are, we must examine whether our data could be generated by a mixture of GEV distributions or whether some other process is involved in generating the data. This process will require careful consideration of die subject matter area which led to the data. The robust estimation techniques are based on optimal B-robust estimates. Their performance is compared to the probability-weighted moment estimates of Hosking et al. (1985) in both simulated and real data.  相似文献   

16.
Structural equation models (SEM) have been extensively used in behavioral, social, and psychological research to model relations between the latent variables and the observations. Most software packages for the fitting of SEM rely on frequentist methods. Traditional models and software are not appropriate for analysis of the dependent observations such as time-series data. In this study, a structural equation model with a time series feature is introduced. A Bayesian approach is used to solve the model with the aid of the Markov chain Monte Carlo method. Bayesian inferences as well as prediction with the proposed time series structural equation model can also reveal certain unobserved relationships among the observations. The approach is successfully employed using real Asian, American and European stock return data.  相似文献   

17.
We study the problem of classification with multiple q-variate observations with and without time effect on each individual. We develop new classification rules for populations with certain structured and unstructured mean vectors and under certain covariance structures. The new classification rules are effective when the number of observations is not large enough to estimate the variance–covariance matrix. Computational schemes for maximum likelihood estimates of required population parameters are given. We apply our findings to two real data sets as well as to a simulated data set.  相似文献   

18.
Summary.  Non-hierarchical clustering methods are frequently based on the idea of forming groups around 'objects'. The main exponent of this class of methods is the k -means method, where these objects are points. However, clusters in a data set may often be due to certain relationships between the measured variables. For instance, we can find linear structures such as straight lines and planes, around which the observations are grouped in a natural way. These structures are not well represented by points. We present a method that searches for linear groups in the presence of outliers. The method is based on the idea of impartial trimming. We search for the 'best' subsample containing a proportion 1− α of the data and the best k affine subspaces fitting to those non-discarded observations by measuring discrepancies through orthogonal distances. The population version of the sample problem is also considered. We prove the existence of solutions for the sample and population problems together with their consistency. A feasible algorithm for solving the sample problem is described as well. Finally, some examples showing how the method proposed works in practice are provided.  相似文献   

19.
Logistic regression is frequently used for classifying observations into two groups. Unfortunately there are often outlying observations in a data set and these might affect the estimated model and the associated classification error rate. In this paper, the authors study the effect of observations in the training sample on the error rate by deriving influence functions. They obtain a general expression for the influence function of the error rate, and they compute it for the maximum likelihood estimator as well as for several robust logistic discrimination procedures. Besides being of interest in their own right, the influence functions are also used to derive asymptotic classification efficiencies of different logistic discrimination rules. The authors also show how influential points can be detected by means of a diagnostic plot based on the values of the influence function  相似文献   

20.
Dependent multivariate count data occur in several research studies. These data can be modelled by a multivariate Poisson or Negative binomial distribution constructed using copulas. However, when some of the counts are inflated, that is, the number of observations in some cells are much larger than other cells, then the copula-based multivariate Poisson (or Negative binomial) distribution may not fit well and it is not an appropriate statistical model for the data. There is a need to modify or adjust the multivariate distribution to account for the inflated frequencies. In this article, we consider the situation where the frequencies of two cells are higher compared to the other cells and develop a doubly inflated multivariate Poisson distribution function using multivariate Gaussian copula. We also discuss procedures for regression on covariates for the doubly inflated multivariate count data. For illustrating the proposed methodologies, we present real data containing bivariate count observations with inflations in two cells. Several models and linear predictors with log link functions are considered, and we discuss maximum likelihood estimation to estimate unknown parameters of the models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号