首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
The presence of outliers would inevitably lead to distorted analysis and inappropriate prediction, especially for multiple outliers in high-dimensional regression, where the high dimensionality of the data might amplify the chance of an observation or multiple observations being outlying. Noting that the detection of outliers is not only necessary but also important in high-dimensional regression analysis, we, in this paper, propose a feasible outlier detection approach in sparse high-dimensional linear regression model. Firstly, we search a clean subset by use of the sure independence screening method and the least trimmed square regression estimates. Then, we define a high-dimensional outlier detection measure and propose a multiple outliers detection approach through multiple testing procedures. In addition, to enhance efficiency, we refine the outlier detection rule after obtaining a relatively reliable non-outlier subset based on the initial detection approach. By comparison studies based on Monte Carlo simulation, it is shown that the proposed method performs well for detecting multiple outliers in sparse high-dimensional linear regression model. We further illustrate the application of the proposed method by empirical analysis of a real-life protein and gene expression data.  相似文献   

2.
An outlier is defined as an observation that is significantly different from the others in its dataset. In high-dimensional regression analysis, datasets often contain a portion of outliers. It is important to identify and eliminate the outliers for fitting a model to a dataset. In this paper, a novel outlier detection method is proposed for high-dimensional regression problems. The leave-one-out idea is utilized to construct a novel outlier detection measure based on distance correlation, and then an outlier detection procedure is proposed. The proposed method enjoys several advantages. First, the outlier detection measure can be simply calculated, and the detection procedure works efficiently even for high-dimensional regression data. Moreover, it can deal with a general regression, which does not require specification of a linear regression model. Finally, simulation studies show that the proposed method behaves well for detecting outliers in high-dimensional regression model and performs better than some other competing methods.  相似文献   

3.
In this paper we present a "model free' method of outlier detection for Gaussian time series by using the autocorrelation structure of the time series. We also present a graphic diagnostic method in order to distinguish an additive outlier (AO) from an innovation outlier (IO). The test statistic for detecting the outlier has a χ ² distribution with one degree of freedom. We show that this method works well when the time series contain either one type of the outliers or both additive and innovation type outliers, and this method has the advantage that no time series model needs to be estimated from the data. Simulation evidence shows that different types of outliers can be graphically distinguished by using the techniques proposed.  相似文献   

4.
In this paper, we propose a novel robust principal component analysis (PCA) for high-dimensional data in the presence of various heterogeneities, in particular strong tailing and outliers. A transformation motivated by the characteristic function is constructed to improve the robustness of the classical PCA. The suggested method has the distinct advantage of dealing with heavy-tail-distributed data, whose covariances may be non-existent (positively infinite, for instance), in addition to the usual outliers. The proposed approach is also a case of kernel principal component analysis (KPCA) and employs the robust and non-linear properties via a bounded and non-linear kernel function. The merits of the new method are illustrated by some statistical properties, including the upper bound of the excess error and the behaviour of the large eigenvalues under a spiked covariance model. Additionally, using a variety of simulations, we demonstrate the benefits of our approach over the classical PCA. Finally, using data on protein expression in mice of various genotypes in a biological study, we apply the novel robust PCA to categorise the mice and find that our approach is more effective at identifying abnormal mice than the classical PCA.  相似文献   

5.
Outlier detection is fundamental to statistical modelling. When there are multiple outliers, many traditional approaches in use are stepwise detection procedures, which can be computationally expensive and ignore stochastic error in the outlier detection process. Outlier detection can be performed by a heteroskedasticity test. In this article, a rapid outlier detection method via multiple heteroskedasticity test based on penalized likelihood approaches is proposed to handle these kinds of problems. The proposed method detects the heteroskedasticity of all data only by one step and estimate coefficients simultaneously. The proposed approach is distinguished from others in that a rapid modelling approach uses a weighted least squares formulation coupled with nonconvex sparsity-including penalization. Furthermore, the proposed approach does not need to construct test statistics and calculate their distributions. A new algorithm is proposed for optimizing penalized likelihood functions. Favourable theoretical properties of the proposed approach are obtained. Our simulation studies and real data analysis show that the newly proposed methods compare favourably with other traditional outlier detection techniques.  相似文献   

6.
ABSTRACT

This article studies the outlier detection problem in mixed regressive-spatial autoregressive model. The formulae for testing outliers and their approximate distributions are derived under the mean-shift model and the variance-weight model, respectively. The simulation studies are conducted for examining the power and size of the test, as well as for the detection of outliers when a simulated data contains several outliers. A real data is analyzed to illustrate the proposed method, and modified models based on mean-shift and variance-weight models in which detected outliers are taken into account are suggested to deal with the outliers and confirm theconclusions.  相似文献   

7.
Multivariate mixture regression models can be used to investigate the relationships between two or more response variables and a set of predictor variables by taking into consideration unobserved population heterogeneity. It is common to take multivariate normal distributions as mixing components, but this mixing model is sensitive to heavy-tailed errors and outliers. Although normal mixture models can approximate any distribution in principle, the number of components needed to account for heavy-tailed distributions can be very large. Mixture regression models based on the multivariate t distributions can be considered as a robust alternative approach. Missing data are inevitable in many situations and parameter estimates could be biased if the missing values are not handled properly. In this paper, we propose a multivariate t mixture regression model with missing information to model heterogeneity in regression function in the presence of outliers and missing values. Along with the robust parameter estimation, our proposed method can be used for (i) visualization of the partial correlation between response variables across latent classes and heterogeneous regressions, and (ii) outlier detection and robust clustering even under the presence of missing values. We also propose a multivariate t mixture regression model using MM-estimation with missing information that is robust to high-leverage outliers. The proposed methodologies are illustrated through simulation studies and real data analysis.  相似文献   

8.
Geometric mean (GM) is having growing and wider applications in statistical data analysis as a measure of central tendency. It is generally believed that GM is less sensitive to outliers than the arithmetic mean (AM) but we suspect likewise the AM the GM may also suffer a huge set back in the presence of outliers, especially when multiple outliers occur in a data. So far as we know, not much work has been done on the robustness issue of GM. In quest of a simple robust measure of central tendency, we propose the geometric median (GMed) in this paper. We show that the classical GM has only 0% breakdown point while it is 50% for the proposed GMed. Numerical examples also support our claim that the proposed GMed is unaffected in the presence of multiple outliers and can maintain the highest possible 50% breakdown. Later we develop a new method for the identification of multiple outliers based on this proposed GMed. A variety of numerical examples show that the proposed method can successfully identify all potential outliers while the traditional GM fails to do so.  相似文献   

9.
Despite the popularity of high dimension, low sample size data analysis, there has not been enough attention to the sample integrity issue, in particular, a possibility of outliers in the data. A new outlier detection procedure for data with much larger dimensionality than the sample size is presented. The proposed method is motivated by asymptotic properties of high-dimensional distance measures. Empirical studies suggest that high-dimensional outlier detection is more likely to suffer from a swamping effect rather than a masking effect, thus yields more false positives than false negatives. We compare the proposed approaches with existing methods using simulated data from various population settings. A real data example is presented with a consideration on the implication of found outliers.  相似文献   

10.
In this article, we propose an outlier detection approach in a multiple regression model using the properties of a difference-based variance estimator. This type of a difference-based variance estimator was originally used to estimate error variance in a non parametric regression model without estimating a non parametric function. This article first employed a difference-based error variance estimator to study the outlier detection problem in a multiple regression model. Our approach uses the leave-one-out type method based on difference-based error variance. The existing outlier detection approaches using the leave-one-out approach are highly affected by other outliers, while ours is not because our approach does not use the regression coefficient estimator. We compared our approach with several existing methods using a simulation study, suggesting the outperformance of our approach. The advantages of our approach are demonstrated using a real data application. Our approach can be extended to the non parametric regression model for outlier detection.  相似文献   

11.
Fuzzy least-square regression can be very sensitive to unusual data (e.g., outliers). In this article, we describe how to fit an alternative robust-regression estimator in fuzzy environment, which attempts to identify and ignore unusual data. The proposed approach concerns classical robust regression and estimation methods that are insensitive to outliers. In this regard, based on the least trimmed square estimation method, an estimation procedure is proposed for determining the coefficients of the fuzzy regression model for crisp input-fuzzy output data. The investigated fuzzy regression model is applied to bedload transport data forecasting suspended load by discharge based on a real world data. The accuracy of the proposed method is compared with the well-known fuzzy least-square regression model. The comparison results reveal that the fuzzy robust regression model performs better than the other models in suspended load estimation for the particular dataset. This comparison is done based on a similarity measure between fuzzy sets. The proposed model is general and can be used for modeling natural phenomena whose available observations are reported as imprecise rather than crisp.  相似文献   

12.
This paper is concerned with the conditional feature screening for ultra-high dimensional right censored data with some previously identified important predictors. A new model-free conditional feature screening approach, conditional correlation rank sure independence screening, has been proposed and investigated theoretically. The suggested conditional screening procedure has several desirable merits. First, it is model free, and thus robust to model misspecification. Second, it has the advantage of robustness of heavy-tailed distributions of the response and the presence of potential outliers in response. Third, it is naturally applicable to complete data when there is no censoring. Through simulation studies, we demonstrate that the proposed approach outperforms the CoxCS of Hong et al. under some circumstances. A real dataset is used to illustrate the usefulness of the proposed conditional screening method.  相似文献   

13.
The Bayesian analysis of outliers using a non-informative prior for the parameters is non-trivial because models with different numbers of outliers have different dimensions. A quasi-Bayesian approach based on the Akaike's predictive likelihood is proposed for the analysis of regression outliers. It overcomes the dimensionality problem in Bayesian outlier analysis in which the likelihood of the outlier model is compensated by a correction factor adjusted for the number of outliers. The stack loss data set is analysed with satisfactory results.  相似文献   

14.
ABSTRACT

Asymmetric models have been discussed quite extensively in recent years, in situations where the normality assumption is suspected due to lack of symmetry in the data. Techniques for assessing the quality of fit and diagnostic analysis are important for model validation. This paper presents a study of the mean-shift method for the detection of outliers in regression models under skew scale-mixtures of normal distributions. Analytical solutions for the estimators of the parameters are obtained through the use of Expectation–Maximization algorithm. The observed information matrix for the calculation of standard errors is obtained for each distribution. Simulation studies and an application to the analysis of a data have been carried out, showing the efficiency of the proposed method in detecting outliers.  相似文献   

15.
In this article, we investigate a new estimation approach for the partially linear single-index model based on modal regression method, where the non parametric function is estimated by penalized spline method. Moreover, we develop an expection maximum (EM)-type algorithm and establish the large sample properties of the proposed estimation method. A distinguishing characteristic of the newly proposed estimation is robust against outliers through introducing an additional tuning parameter which can be automatically selected using the observed data. Simulation studies and real data example are used to evaluate the finite-sample performance, and the results show that the newly proposed method works very well.  相似文献   

16.
The problem of outliers in statistical data has attracted many researchers for a long time. Consequently, numerous outlier detection methods have been proposed in the statistical literature. However, no consensus has emerged as to which method is uniformly better than the others or which one is recommended for use in practical situations. In this article, we perform an extensive comparative Monte Carlo simulation study to assess the performance of the multiple outlier detection methods that are either recently proposed or frequently cited in the outlier detection literature. Our simulation experiments include a wide variety of realistic and challenging regression scenarios. We give recommendations on which method is superior to others under what conditions.  相似文献   

17.
Many methods have been developed for detecting multiple outliers in a single multivariate sample, but very few for the case where there may be groups in the data. We propose a method of simultaneously determining groups (as in cluster analysis) and detecting outliers, which are points that are distant from every group. Our method is an adaptation of the BACON algorithm proposed by Billor, Hadi and Velleman for the robust detection of multiple outliers in a single group of multivariate data. There are two versions of our method, depending on whether or not the groups can be assumed to have equal covariance matrices. The effectiveness of the method is illustrated by its application to two real data sets and further shown by a simulation study for different sample sizes and dimensions for 2 and 3 groups, with and without planted outliers in the data. When the number of groups is not known in advance, the algorithm could be used as a robust method of cluster analysis, by running it for various numbers of groups and choosing the best solution.  相似文献   

18.
In this article, we discuss the estimation of the parameter function for a functional logistic regression model in the presence of outliers. We consider ways that allow for the parameter estimator to be resistant to outliers, in addition to minimizing multicollinearity and reducing the high dimensionality, which is inherent with functional data. To achieve this, the functional covariates and functional parameter of the model are approximated in a finite-dimensional space generated by an appropriate basis. This approach reduces the functional model to a standard multiple logistic model with highly collinear covariates and potential high-dimensionality issues. The proposed estimator tackles these issues and also minimizes the effect of functional outliers. Results from a simulation study and a real world example are also presented to illustrate the performance of the proposed estimator.  相似文献   

19.
In this article, utilizing a scale mixture of skew-normal distribution in which mixing random variable is assumed to follow a mixture model with varying weights for each observation, we introduce a generalization of skew-normal linear regression model with the aim to provide resistant results. This model, which also includes the skew-slash distribution in a particular case, allows us to accommodate and detect outlying observations under the skew-normal linear regression model. Inferences about the model are carried out through the empirical Bayes approach. The conditions for propriety of the posterior and for existence of posterior moments are given under the standard noninformative priors for regression and scale parameters as well as proper prior for skewness parameter. Then, for Bayesian inference, a Markov chain Monte Carlo method is described. Since posterior results depend on the prior hyperparameters, we estimate them adopting the empirical Bayes method as well as using a Monte Carlo EM algorithm. Furthermore, to identify possible outliers, we also apply the Bayes factor obtained through the generalized Savage-Dickey density ratio. Examining the proposed approach on simulated instance and real data, it is found to provide not only satisfactory parameter estimates rather allow identifying outliers favorably.  相似文献   

20.
The estimation of the covariance matrix is important in the analysis of bivariate longitudinal data. A good estimator for the covariance matrix can improve the efficiency of the estimators of the mean regression coefficients. Furthermore, the covariance estimation itself is also of interest, but it is a challenging job to model the covariance matrix of bivariate longitudinal data due to the complex structure and positive definite constraint. In addition, most of existing approaches are based on the maximum likelihood, which is very sensitive to outliers or heavy-tail error distributions. In this article, an adaptive robust estimation method is proposed for bivariate longitudinal data. Unlike the existing likelihood-based methods, the proposed method can adapt to different error distributions. Specifically, at first, we utilize the modified Cholesky block decomposition to parameterize the covariance matrices. Secondly, we apply the bounded Huber's score function to develop a set of robust generalized estimating equations to estimate the parameters both in the mean and the covariance models simultaneously. A data-driven approach is presented to select the parameter c in the Huber's score function, which can ensure that the proposed method is robust and efficient. A simulation study and a real data analysis are conducted to illustrate the robustness and efficiency of the proposed approach.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号