首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
ABSTRACT

In high-dimensional regression, the presence of influential observations may lead to inaccurate analysis results so that it is a prime and important issue to detect these unusual points before statistical regression analysis. Most of the traditional approaches are, however, based on single-case diagnostics, and they may fail due to the presence of multiple influential observations that suffer from masking effects. In this paper, an adaptive multiple-case deletion approach is proposed for detecting multiple influential observations in the presence of masking effects in high-dimensional regression. The procedure contains two stages. Firstly, we propose a multiple-case deletion technique, and obtain an approximate clean subset of the data that is presumably free of influential observations. To enhance efficiency, in the second stage, we refine the detection rule. Monte Carlo simulation studies and a real-life data analysis investigate the effective performance of the proposed procedure.  相似文献   

2.
Kupper and Meydrech and Myers and Lahoda introduced the mean squared error (MSE) approach to study response surface designs, Duncan and DeGroot derived a criterion for optimality of linear experimental designs based on minimum mean squared error. However, minimization of the MSE of an estimator maxr renuire some knowledge about the unknown parameters. Without such knowledge construction of designs optimal in the sense of MSE may not be possible. In this article a simple method of selecting the levels of regressor variables suitable for estimating some functions of the parameters of a lognormal regression model is developed using a criterion for optimality based on the variance of an estimator. For some special parametric functions, the criterion used here is equivalent to the criterion of minimizing the mean squared error. It is found that the maximum likelihood estimators of a class of parametric functions can be improved substantially (in the sense of MSE) by proper choice of the values of regressor variables. Moreover, our approach is applicable to analysis of variance as well as regression designs.  相似文献   

3.
4.
We consider the problem of estimation of a density function in the presence of incomplete data and study the Hellinger distance between our proposed estimators and the true density function. Here, the presence of incomplete data is handled by utilizing a Horvitz–Thompson-type inverse weighting approach, where the weights are the estimates of the unknown selection probabilities. We also address the problem of estimation of a regression function with incomplete data.  相似文献   

5.
Several methods have been suggested, in the literature, to detect influential observations from the data fitting usual linear model y=X???+???, ???∽N(0, ???2I). Recently, Chatterjee & Hadi (1986) have reviewed most of these available methods and described the inter-relationships between them. In this article, we extend some of these methods to the case of multivariate regression data. We consider several data sets to illustrate the methods.  相似文献   

6.
The identification of influential observations has drawn a great deal of attention in regression diagnostics. Most of these identification techniques are based on single case deletion and among them DFFITS has become very popular with the statisticians. But this technique along with all other single case diagnostics may be ineffective in the presence of multiple influential observations. In this paper we develop a generalized version of DFFITS based on group deletion and then propose a new technique to identify multiple influential observations using this. The advantage of using the proposed method in the identification of multiple influential cases is then investigated through several well-referred data sets.  相似文献   

7.
In a regression or classification setting where we wish to predict Y from x1,x2,..., xp, we suppose that an additional set of coaching variables z1,z2,..., zm are available in our training sample. These might be variables that are difficult to measure, and they will not be available when we predict Y from x1,x2,..., xp in the future. We consider two methods of making use of the coaching variables in order to improve the prediction of Y from x1,x2,..., xp. The relative merits of these approaches are discussed and compared in a number of examples.  相似文献   

8.
The identification of influential observations in logistic regression has drawn a great deal of attention in recent years. Most of the available techniques like Cook's distance and difference of fits (DFFITS) are based on single-case deletion. But there is evidence that these techniques suffer from masking and swamping problems and consequently fail to detect multiple influential observations. In this paper, we have developed a new measure for the identification of multiple influential observations in logistic regression based on a generalized version of DFFITS. The advantage of the proposed method is then investigated through several well-referred data sets and a simulation study.  相似文献   

9.
In this paper we investigate under which conditions it is preferable to use proxies or to omit variables from the linear regression model with respect to the matrix mean square error criterion. Furthermore, some attention is paid to the admissibility of the proxies-based least squares estimator.  相似文献   

10.
This note considers a method for estimating regression parameters from the data containing measurement errors using some natural estimates of the unobserved explanatory variables. It is shown that the resulting estimator is consistent not only in the usual linear regression model but also in the probit model and regression models with censoship or truncation. However, it fails to be consistent in nonlinear regression models except for special cases.  相似文献   

11.
Linear regression with compositional explanatory variables   总被引:1,自引:0,他引:1  
Compositional explanatory variables should not be directly used in a linear regression model because any inference statistic can become misleading. While various approaches for this problem were proposed, here an approach based on the isometric logratio (ilr) transformation is used. It turns out that the resulting model is easy to handle, and that parameter estimation can be done in like in usual linear regression. Moreover, it is possible to use the ilr variables for inference statistics in order to obtain an appropriate interpretation of the model.  相似文献   

12.
This note considers a method for estimating regression parameters from the data containing measurement errors using some natural estimates of the unobserved explanatory variables. It is shown that the resulting estimator is consistent not only in the usual linear regression model but also in the probit model and regression models with censoship or truncation. However, it fails to be consistent in nonlinear regression models except for special cases.  相似文献   

13.
A method for detecting outliers in axial data has been proposed by Best and Fisher (1986 Best, D.J., Fisher, N.I. (1986). Goodness-of-fit and discordancy tests for samples from the Watson distribution on the sphere. Aust. J. Stat. 28:1331.[Crossref] [Google Scholar]). For extending that work, we propose four new methods. Two of them are suitable for outlier detection and they depend on the classic geodesic distance and a modified version of this distance. The other two procedures, which are designed for influential observation detection, are based on the Kullback–Leibler and Cook’s distances. Some simulation experiments are performed to compare all considered methods. Detection and error rates are used as comparison criteria. Numerical results provide evidence in favor of the KL distance.  相似文献   

14.
A methodology is developed for selecting the order of an ARMA representation of a short realization. The methodology is based on an extension of the Instrumental Variables technique and its theoretical logic is supported by the characteristic of extended Yule-Walker equations and Toeplitz matrices. The methodology is a modification of the Cormer Method and tries to identify a set of orders instead of a single order. The strength of the methodology is evaluated by comparing its numerical findings with that from the Corner Method and the Extended Sample Autocorrelation Function Method. The numerical results imply that (i) the proposed method performs, on the average better than the Corner Method and both methods outperform Extended Sample Autocorrelation Function method, and (ii) the selection of a set of orders provides more reliable results than the selection of a single order.  相似文献   

15.
Let (?,X) be a random vector such that E(X|?) = ? and Var(x|?) a + b? + c?2 for some known constants a, b and c. Assume X1,…,Xn are independent observations which have the same distribution as X. Let t(X) be the linear regression of ? on X. The linear empirical Bayes estimator is used to approximate the linear regression function. It is shown that under appropriate conditions, the linear empirical Bayes estimator approximates the linear regression well in the sense of mean squared error.  相似文献   

16.
A polynomial functional relationship with errors in both variables can be consistently estimated by constructing an ordinary least squares estimator for the regression coefficients, assuming hypothetically the latent true regressor variable to be known, and then adjusting for the errors. If normality of the error variables can be assumed, the estimator can be simplified considerably. Only the variance of the errors in the regressor variable and its covariance with the errors of the response variable need to be known. If the variance of the errors in the dependent variable is also known, another estimator can be constructed.  相似文献   

17.
Interval-valued variables have become very common in data analysis. Up until now, symbolic regression mostly approaches this type of data from an optimization point of view, considering neither the probabilistic aspects of the models nor the nonlinear relationships between the interval response and the interval predictors. In this article, we formulate interval-valued variables as bivariate random vectors and introduce the bivariate symbolic regression model based on the generalized linear models theory which provides much-needed exibility in practice. Important inferential aspects are investigated. Applications to synthetic and real data illustrate the usefulness of the proposed approach.  相似文献   

18.
In this paper, two new multiple influential observation detection methods, GCD.GSPR and mCD*, are introduced for logistic regression. The proposed diagnostic measures are compared with the generalized difference in fits (GDFFITS) and the generalized squared difference in beta (GSDFBETA), which are multiple influential diagnostics. The simulation study is conducted with one, two and five independent variable logistic regression models. The performance of the diagnostic measures is examined for a single contaminated independent variable for each model and in the case where all the independent variables are contaminated with certain contamination rates and intensity. In addition, the performance of the diagnostic measures is compared in terms of the correct identification rate and swamping rate via a frequently referred to data set in the literature.  相似文献   

19.
We consider estimation in a high-dimensional linear model with strongly correlated variables. We propose to cluster the variables first and do subsequent sparse estimation such as the Lasso for cluster-representatives or the group Lasso based on the structure from the clusters. Regarding the first step, we present a novel and bottom-up agglomerative clustering algorithm based on canonical correlations, and we show that it finds an optimal solution and is statistically consistent. We also present some theoretical arguments that canonical correlation based clustering leads to a better-posed compatibility constant for the design matrix which ensures identifiability and an oracle inequality for the group Lasso. Furthermore, we discuss circumstances where cluster-representatives and using the Lasso as subsequent estimator leads to improved results for prediction and detection of variables. We complement the theoretical analysis with various empirical results.  相似文献   

20.
Since the seminal paper by Cook (1977) in which he introduced Cook's distance, the identification of influential observations has received a great deal of interest and extensive investigation in linear regression. It is well documented that most of the popular diagnostic measures that are based on single-case deletion can mislead the analysis in the presence of multiple influential observations because of the well-known masking and/or swamping phenomena. Atkinson (1981) proposed a modification of Cook's distance. In this paper we propose a further modification of the Cook's distance for the identification of a single influential observation. We then propose new measures for the identification of multiple influential observations, which are not affected by the masking and swamping problems. The efficiency of the new statistics is presented through several well-known data sets and a simulation study.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号