首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A large number of statistics are used in the literature to detect outliers and influential observations in the linear regression model. In this paper comparison studies have been made for determining a statistic which performs better than the other. This includes: (i) a detailed simulation study, and (ii) analyses of several data sets studied by different authors. Different choices of the design matrix of regression model are considered. Design A studies the performance of the various statistics for detecting the scale shift type outliers, and designs B and C provide information on the performance of the statistics for identifying the influential observations. We have used cutoff points using the exact distributions and Bonferroni's inequality for each statistic. The results show that the studentized residual which is used for detection of mean shift outliers is appropriate for detection of scale shift outliers also, and the Welsch's statistic and the Cook's distance are appropriate for detection of influential observations.  相似文献   

2.
Detection of multiple unusual observations such as outliers, high leverage points and influential observations (IOs) in regression is still a challenging task for statisticians due to the well-known masking and swamping effects. In this paper we introduce a robust influence distance that can identify multiple IOs, and propose a sixfold plotting technique based on the well-known group deletion approach to classify regular observations, outliers, high leverage points and IOs simultaneously in linear regression. Experiments through several well-referred data sets and simulation studies demonstrate that the proposed algorithm performs successfully in the presence of multiple unusual observations and can avoid masking and/or swamping effects.  相似文献   

3.
We propose two novel diagnostic measures for the detection of influential observations for regression parameters in linear regression. Traditional diagnostic statistics focus on the effect of deletion of data points either on parameter estimates, or on predicted values. A data point is regarded as influential by the new methods if its inclusion determines a significantly different likelihood function for the parameter of interest. The concerned likelihood function is asymptotically valid for practically all underlying distributions whose second moments exist.  相似文献   

4.
Linear models constitute the primary statistical technique for any experimental science. A major topic in this area is the detection of influential subsets of data, that is, of observations that are influential in terms of their effect on the estimation of parameters in linear regression or of the total population parameters. Numerous studies exist on radiocarbon dating which propose a value consensus and remove possible outliers after the corresponding testing. An influence analysis for the value consensus from a Bayesian perspective is developed in this article.  相似文献   

5.
微观统计数据的公布及相应的保密方法   总被引:1,自引:0,他引:1       下载免费PDF全文
 目前国内大多数的数据机构所收集的微观数据并没有直接对外公布。由于微观层面的数据不能为外界所用,也造成了一种社会资源的浪费。我们认为,数据机构采用适当的方法对微观数据进行处理,然后对外公布处理后的数据,可以较好地解决这一问题。一方面,原始数据的绝大部分信息得以保存,可以满足不同数据用户的需求,另一方面,数据泄密的风险也被大大降低,能满足数据机构保密的需求。 本文的主要目的是通过介绍国外的一些普遍采用的微观数据的处理方法,以期为国内数据机构公布微观数据提供理论依据和一些切实可行的操作方法,借以抛砖引玉,希望可以引起国内数据机构的重视及统计学界在此方面更多的研究和创新。  相似文献   

6.
The detection of outliers and influential observations has received a great deal of attention in the statistical literature in the context of least-squares (LS) regression. However, the explanatory variables can be correlated with each other and alternatives to LS come out to address outliers/influential observations and multicollinearity, simultaneously. This paper proposes new influence measures based on the affine combination type regression for the detection of influential observations in the linear regression model when multicollinearity exists. Approximate influence measures are also proposed for the affine combination type regression. Since the affine combination type regression includes the ridge, the Liu and the shrunken regressions as special cases, influence measures under the ridge, the Liu and the shrunken regressions are also examined to see the possible effect that multicollinearity can have on the influence of an observation. The Longley data set is given illustrating the influence measures in affine combination type regression and also in ridge, Liu and shrunken regressions so that the performance of different biased regressions on detecting and assessing the influential observations is examined.  相似文献   

7.
The presence of outliers would inevitably lead to distorted analysis and inappropriate prediction, especially for multiple outliers in high-dimensional regression, where the high dimensionality of the data might amplify the chance of an observation or multiple observations being outlying. Noting that the detection of outliers is not only necessary but also important in high-dimensional regression analysis, we, in this paper, propose a feasible outlier detection approach in sparse high-dimensional linear regression model. Firstly, we search a clean subset by use of the sure independence screening method and the least trimmed square regression estimates. Then, we define a high-dimensional outlier detection measure and propose a multiple outliers detection approach through multiple testing procedures. In addition, to enhance efficiency, we refine the outlier detection rule after obtaining a relatively reliable non-outlier subset based on the initial detection approach. By comparison studies based on Monte Carlo simulation, it is shown that the proposed method performs well for detecting multiple outliers in sparse high-dimensional linear regression model. We further illustrate the application of the proposed method by empirical analysis of a real-life protein and gene expression data.  相似文献   

8.
When multiple data owners possess records on different subjects with the same set of attributes—known as horizontally partitioned data—the data owners can improve analyses by concatenating their databases. However, concatenation of data may be infeasible because of confidentiality concerns. In such settings, the data owners can use secure computation techniques to obtain the results of certain analyses on the integrated database without sharing individual records. We present secure computation protocols for Bayesian model averaging and model selection for both linear regression and probit regression. Using simulations based on genuine data, we illustrate the approach for probit regression, and show that it can provide reasonable model selection outputs.  相似文献   

9.
The author presents a robust F-test for comparing nested linear models. It is suggested that the approach will be attractive to practitioners because it is based on the familiar F-statistic and corresponds to the common practice of reporting F-statistics after removing obvious outliers. It is calibrated in terms of a real parameter that can be directly interpreted as the willingness of the data analyst to remove observations, and the sensitivity of the F-statistic to this parameter is easily examined. The procedure is evaluated with a simulation study where a scale mixture distribution is used to generate outliers. The procedure is also applied to some data where the occurrence of an outlier is confounded with the significance of a regression term. This provides a comparison of two competing models for the data: one removing an outlier and the other including an additional regression term instead.  相似文献   

10.
Five widely used test statistics for detecting outliers and influential observations were studied using Monte Carlo method . The test statistic based on Studentized residuals, with critical values given by Tietjen, Moore and Beckman (1973), appears to be the best procedure for detecting a single outlier in simple linear regression.  相似文献   

11.
Outliers in multilevel data   总被引:2,自引:0,他引:2  
This paper offers the data analyst a range of practical procedures for dealing with outliers in multilevel data. It first develops several techniques for data exploration for outliers and outlier analysis and then applies these to the detailed analysis of outliers in two large scale multilevel data sets from educational contexts. The techniques include the use of deviance reduction, measures based on residuals, leverage values, hierarchical cluster analysis and a measure called DFITS. Outlier analysis is more complex in a multilevel data set than in, say, a univariate sample or a set of regression data, where the concept of an outlying value is straightforward. In the multilevel situation one has to consider, for example, at what level or levels a particular response is outlying, and in respect of which explanatory variables; furthermore, the treatment of a particular response at one level may affect its status or the status of other units at other levels in the model.  相似文献   

12.
The effect of influentia lob servations on t h e parameter estimates of ordinary l e a s t squares regression models has received considerable attentio n fn the last decade. However, very little attention has been given t o the problem of in fluent ia lobserva- tions in the analysis of variance . The purpose of t h i s paper is t o show by way of examples that influential observations can alter the conclusions of tests of hypotheses in the analysis of variance . Regression diagnostics for identif y in g both extreme points and outliers can be used to reveal potential data and design problems.  相似文献   

13.
We present a class of truncated non linear regression models for location and scale where the truncated nature of the data is incorporated into the statistical model by assuming that the response variable follows a truncated distribution. The location parameter of the response variable is assumed to be modeled by a continuous non linear function of covariates and unknown parameters. In addition, the proposed model also allows for the scale parameter of the responses to be characterized by a continuous function of the covariates and unknown parameters. Three particular cases of the proposed models are presented by considering the response variable to follow a truncated normal, truncated skew normal, and truncated beta distribution. These truncated non linear regression models are constructed assuming fixed known truncation limits and model parameters are estimated by direct maximization of the log-likelihood using a non linear optimization algorithm. Standardized residuals and diagnostic metrics based on the cases deletion are considered to verify the adequacy of the model and to detect outliers and influential observations. Results based on simulated data are presented to assess the frequentist properties of estimates, and a real data set on soil-water retention from the Buriti Vermelho River Basin database is analyzed using the proposed methodology.  相似文献   

14.
Detection of outliers or influential observations is an important work in statistical modeling, especially for the correlated time series data. In this paper we propose a new procedure to detect patch of influential observations in the generalized autoregressive conditional heteroskedasticity (GARCH) model. Firstly we compare the performance of innovative perturbation scheme, additive perturbation scheme and data perturbation scheme in local influence analysis. We find that the innovative perturbation scheme give better result than other two schemes although this perturbation scheme may suffer from masking effects. Then we use the stepwise local influence method under innovative perturbation scheme to detect patch of influential observations and uncover the masking effects. The simulated studies show that the new technique can successfully detect a patch of influential observations or outliers under innovative perturbation scheme. The analysis based on simulation studies and two real data sets show that the stepwise local influence method under innovative perturbation scheme is efficient for detecting multiple influential observations and dealing with masking effects in the GARCH model.  相似文献   

15.
An outlier is defined as an observation that is significantly different from the others in its dataset. In high-dimensional regression analysis, datasets often contain a portion of outliers. It is important to identify and eliminate the outliers for fitting a model to a dataset. In this paper, a novel outlier detection method is proposed for high-dimensional regression problems. The leave-one-out idea is utilized to construct a novel outlier detection measure based on distance correlation, and then an outlier detection procedure is proposed. The proposed method enjoys several advantages. First, the outlier detection measure can be simply calculated, and the detection procedure works efficiently even for high-dimensional regression data. Moreover, it can deal with a general regression, which does not require specification of a linear regression model. Finally, simulation studies show that the proposed method behaves well for detecting outliers in high-dimensional regression model and performs better than some other competing methods.  相似文献   

16.
17.
The author develops a robust quasi‐likelihood method, which appears to be useful for down‐weighting any influential data points when estimating the model parameters. He illustrates the computational issues of the method in an example. He uses simulations to study the behaviour of the robust estimates when data are contaminated with outliers, and he compares these estimates to those obtained by the ordinary quasi‐likelihood method.  相似文献   

18.
The investigation on the identification of outliers in linear regression models can be extended to those for circular regression case. In this paper, we propose a new numerical statistic called mean circular error to identify possible outliers in circular regression models by using a row deletion approach. Through intensive simulation studies, the cut-off points of the statistic are obtained and its power of performance investigated. It is found that the performance improves as the concentration parameter of circular residuals becomes larger or the sample size becomes smaller. As an illustration, the statistic is applied to a wind direction data set.  相似文献   

19.
ABSTRACT

In this paper, we consider an effective Bayesian inference for censored Student-t linear regression model, which is a robust alternative to the usual censored Normal linear regression model. Based on the mixture representation of the Student-t distribution, we propose a non-iterative Bayesian sampling procedure to obtain independently and identically distributed samples approximately from the observed posterior distributions, which is different from the iterative Markov Chain Monte Carlo algorithm. We conduct model selection and influential analysis using the posterior samples to choose the best fitted model and to detect latent outliers. We illustrate the performance of the procedure through simulation studies, and finally, we apply the procedure to two real data sets, one is the insulation life data with right censoring and the other is the wage rates data with left censoring, and we get some interesting results.  相似文献   

20.
Leverage values are being used in regression diagnostics as measures of influential observations in the $X$-space. Detection of high leverage values is crucial because of their responsibility for misleading conclusion about the fitting of a regression model, causing multicollinearity problems, masking and/or swamping of outliers, etc. Much work has been done on the identification of single high leverage points and it is generally believed that the problem of detection of a single high leverage point has been largely resolved. But there is no general agreement among the statisticians about the detection of multiple high leverage points. When a group of high leverage points is present in a data set, mainly because of the masking and/or swamping effects the commonly used diagnostic methods fail to identify them correctly. On the other hand, the robust alternative methods can identify the high leverage points correctly but they have a tendency to identify too many low leverage points to be points of high leverages which is not also desired. An attempt has been made to make a compromise between these two approaches. We propose an adaptive method where the suspected high leverage points are identified by robust methods and then the low leverage points (if any) are put back into the estimation data set after diagnostic checking. The usefulness of our newly proposed method for the detection of multiple high leverage points is studied by some well-known data sets and Monte Carlo simulations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号