首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 296 毫秒
1.
We propose two novel diagnostic measures for the detection of influential observations for regression parameters in linear regression. Traditional diagnostic statistics focus on the effect of deletion of data points either on parameter estimates, or on predicted values. A data point is regarded as influential by the new methods if its inclusion determines a significantly different likelihood function for the parameter of interest. The concerned likelihood function is asymptotically valid for practically all underlying distributions whose second moments exist.  相似文献   

2.
Count data are very often analyzed under the assumption of a Poisson model [(Agresti, A., 1996. An Introduction to Categorical Data Analysis. Wiley, New York; Generalized Linear Models, second ed. Chapman & Hall, New York)]. However, the derived inference is generally erroneous if the underlying distribution is not Poisson (Biometrika 70, 269–274).A parametric robust regression approach is proposed for the analysis of count data. More specifically it will be demonstrated that the Poisson regression model could be properly adjusted to become asymptotically valid for inference about regression parameters, even if the Poisson assumption fails. With large samples the novel robust methodology provides legitimate likelihood functions for regression parameters, so long as the true underlying distributions have finite second moments. Adjustments that robustify the Poisson regression will be given, respectively, under log link and identity link functions. Simulation studies will be used to demonstrate the efficacy of the robust Poisson regression model.  相似文献   

3.
An extension of some standard likelihood based procedures to heteroscedastic nonlinear regression models under scale mixtures of skew-normal (SMSN) distributions is developed. This novel class of models provides a useful generalization of the heteroscedastic symmetrical nonlinear regression models (Cysneiros et al., 2010), since the random term distributions cover both symmetric as well as asymmetric and heavy-tailed distributions such as skew-t, skew-slash, skew-contaminated normal, among others. A simple EM-type algorithm for iteratively computing maximum likelihood estimates of the parameters is presented and the observed information matrix is derived analytically. In order to examine the performance of the proposed methods, some simulation studies are presented to show the robust aspect of this flexible class against outlying and influential observations and that the maximum likelihood estimates based on the EM-type algorithm do provide good asymptotic properties. Furthermore, local influence measures and the one-step approximations of the estimates in the case-deletion model are obtained. Finally, an illustration of the methodology is given considering a data set previously analyzed under the homoscedastic skew-t nonlinear regression model.  相似文献   

4.
The aim of this paper is to define and develop diagnostic measures with respect to kernel ridge regression in a reproducing kernel Hilbert space (RKHS). To identify influential observations, we define a particular version of Cook’s distance for the kernel ridge regression model in RKHS, which is conceptually consistent with Cook’s distance in a classical regression model. Then, by using the perturbation formula for the regularized conditional expectation of the outcome in RKHS, we develop an approximate version of Cook”s distance in RKHS because the original definition requires intensive computations. Such an approximated Cook”s distance is represented in terms of basic building blocks such as residuals and leverages of the kernel ridge regression. The results of the simulation and real application demonstrate that our diagnostic measure successfully detects potentially influential observations on estimators in kernel ridge regression.  相似文献   

5.
The analysis of traffic accident data is crucial to address numerous concerns, such as understanding contributing factors in an accident''s chain-of-events, identifying hotspots, and informing policy decisions about road safety management. The majority of statistical models employed for analyzing traffic accident data are logically count regression models (commonly Poisson regression) since a count – like the number of accidents – is used as the response. However, features of the observed data frequently do not make the Poisson distribution a tenable assumption. For example, observed data rarely demonstrate an equal mean and variance and often times possess excess zeros. Sometimes, data may have heterogeneous structure consisting of a mixture of populations, rather than a single population. In such data analyses, mixtures-of-Poisson-regression models can be used. In this study, the number of injuries resulting from casualties of traffic accidents registered by the General Directorate of Security (Turkey, 2005–2014) are modeled using a novel mixture distribution with two components: a Poisson and zero-truncated-Poisson distribution. Such a model differs from existing mixture models in literature where the components are either all Poisson distributions or all zero-truncated Poisson distributions. The proposed model is compared with the Poisson regression model via simulation and in the analysis of the traffic data.  相似文献   

6.
In practice, the presence of influential observations may lead to misleading results in variable screening problems. We, therefore, propose a robust variable screening procedure for high-dimensional data analysis in this paper. Our method consists of two steps. The first step is to define a new high-dimensional influence measure and propose a novel influence diagnostic procedure to remove those unusual observations. The second step is to utilize the sure independence screening procedure based on distance correlation to select important variables in high-dimensional regression analysis. The new influence measure and diagnostic procedure that we developed are model free. To confirm the effectiveness of the proposed method, we conduct simulation studies and a real-life data analysis to illustrate the merits of the proposed approach over some competing methods. Both the simulation results and the real-life data analysis demonstrate that the proposed method can greatly control the adverse effect after detecting and removing those unusual observations, and performs better than the competing methods.  相似文献   

7.
A large number of statistics are used in the literature to detect outliers and influential observations in the linear regression model. In this paper comparison studies have been made for determining a statistic which performs better than the other. This includes: (i) a detailed simulation study, and (ii) analyses of several data sets studied by different authors. Different choices of the design matrix of regression model are considered. Design A studies the performance of the various statistics for detecting the scale shift type outliers, and designs B and C provide information on the performance of the statistics for identifying the influential observations. We have used cutoff points using the exact distributions and Bonferroni's inequality for each statistic. The results show that the studentized residual which is used for detection of mean shift outliers is appropriate for detection of scale shift outliers also, and the Welsch's statistic and the Cook's distance are appropriate for detection of influential observations.  相似文献   

8.
Several methods have been suggested to detect influential observations in the linear regression model and a number of them have been extended for the multivariate regression model. In this article we consider the multivariate general linear model, Y = XB + k , which contains the linear regression model and the multivariate regression model as particular cases. Assuming that the random disturbances are normally distributed, the BLUE of v B is also normally distributed. Since the distribution of the BLUE of v B and the distribution of the BLUE of v B in the model with the omission of a set of observations differ, to study the influence that a set of observations has on the BLUE of v B , we propose to measure the distance between both distributions. To do this we use Rao distance.  相似文献   

9.
Count data analysis techniques have been developed in biological and medical research areas. In particular, zero-inflated versions of parametric count distributions have been used to model excessive zeros that are often present in these assays. The most common count distributions for analyzing such data are Poisson and negative binomial. However, a Poisson distribution can only handle equidispersed data and a negative binomial distribution can only cope with overdispersion. However, a Conway–Maxwell–Poisson (CMP) distribution [4] can handle a wide range of dispersion. We show, with an illustrative data set on next-generation sequencing of maize hybrids, that both underdispersion and overdispersion can be present in genomic data. Furthermore, the maize data set consists of clustered observations and, therefore, we develop inference procedures for a zero-inflated CMP regression that incorporates a cluster-specific random effect term. Unlike the Gaussian models, the underlying likelihood is computationally challenging. We use a numerical approximation via a Gaussian quadrature to circumvent this issue. A test for checking zero-inflation has also been developed in our setting. Finite sample properties of our estimators and test have been investigated by extensive simulations. Finally, the statistical methodology has been applied to analyze the maize data mentioned before.  相似文献   

10.
In this article, the parametric robust regression approaches are proposed for making inferences about regression parameters in the setting of generalized linear models (GLMs). The proposed methods are able to test hypotheses on the regression coefficients in the misspecified GLMs. More specifically, it is demonstrated that with large samples, the normal and gamma regression models can be properly adjusted to become asymptotically valid for inferences about regression parameters under model misspecification. These adjusted regression models can provide the correct type I and II error probabilities and the correct coverage probability for continuous data, as long as the true underlying distributions have finite second moments.  相似文献   

11.
A typical added variable plot is a commonly used plot in assessing the accuracy of a normal linear model. This plot is often used to evaluate the effect of adding an explanatory variable into the model and to detect possibly high leverage points or influential observations on the added variable. However, this type of plot is generally in doubt, once the normal distributional assumptions are violated. In this article, we extend the robust likelihood technique introduced by Royall and Tsou [11] to propose a robust added variable plot. The validity of this diagnostic plot requires no knowledge of the true underlying distributions so long as their second moments exist. The usefulness of the robust graphical approach is demonstrated through a few illustrations and simulations.  相似文献   

12.
The Poisson GWMA (PGWMA) control chart is an extension model of Poisson EWMA chart. It is substantially sensitive to small process shifts for monitoring Poisson observations. Recently, some approaches have been proposed to modify EWMA charts with fast initial response (FIR) features. In this article, we employ these approaches in PGWMA charts and introduce a novel chart called Poisson double GWMA (PDGWMA) chart for comparison. Using simulation, various control schemes are designed and their average run lengths (ARLs) are computer and compared. It is shown that the PDGWMA chart is the first choice in detecting small shifts especially when the shifts are downward, and the PGWMA chart with adjusted time-varying control limits performs excellently in detecting great process shifts during the initial stage.  相似文献   

13.
Normality and independence of error terms are typical assumptions for partial linear models. However, these assumptions may be unrealistic in many fields, such as economics, finance and biostatistics. In this paper, a Bayesian analysis for partial linear model with first-order autoregressive errors belonging to the class of the scale mixtures of normal distributions is studied in detail. The proposed model provides a useful generalization of the symmetrical linear regression model with independent errors, since the distribution of the error term covers both correlated and thick-tailed distributions, and has a convenient hierarchical representation allowing easy implementation of a Markov chain Monte Carlo scheme. In order to examine the robustness of the model against outlying and influential observations, a Bayesian case deletion influence diagnostics based on the Kullback–Leibler (K–L) divergence is presented. The proposed method is applied to monthly and daily returns of two Chilean companies.  相似文献   

14.
Logistic regression is frequently used for classifying observations into two groups. Unfortunately there are often outlying observations in a data set and these might affect the estimated model and the associated classification error rate. In this paper, the authors study the effect of observations in the training sample on the error rate by deriving influence functions. They obtain a general expression for the influence function of the error rate, and they compute it for the maximum likelihood estimator as well as for several robust logistic discrimination procedures. Besides being of interest in their own right, the influence functions are also used to derive asymptotic classification efficiencies of different logistic discrimination rules. The authors also show how influential points can be detected by means of a diagnostic plot based on the values of the influence function  相似文献   

15.
Data‐analytic tools for models other than the normal linear regression model are relatively rare. Here we develop plots and diagnostic statistics for nonconstant variance for the random‐effects model (REM). REMs for longitudinal data include both within‐ and between‐subject variances. A basic assumption is that the two variance terms are constant across subjects. However, we often find that these variances are functions of covariates, and the data set has what we call explainable heterogeneity, which needs to be allowed for in the model. We characterize several types of heterogeneity of variance in REMs and develop three diagnostic tests using the score statistic: one for each of the two variance terms, and the third for a form of multivariate nonconstant variance. For each test we present an adjusted residual plot which can identify cases that are unusually influential on the outcome of the test.  相似文献   

16.
The detection of outliers and influential observations has received a great deal of attention in the statistical literature in the context of least-squares (LS) regression. However, the explanatory variables can be correlated with each other and alternatives to LS come out to address outliers/influential observations and multicollinearity, simultaneously. This paper proposes new influence measures based on the affine combination type regression for the detection of influential observations in the linear regression model when multicollinearity exists. Approximate influence measures are also proposed for the affine combination type regression. Since the affine combination type regression includes the ridge, the Liu and the shrunken regressions as special cases, influence measures under the ridge, the Liu and the shrunken regressions are also examined to see the possible effect that multicollinearity can have on the influence of an observation. The Longley data set is given illustrating the influence measures in affine combination type regression and also in ridge, Liu and shrunken regressions so that the performance of different biased regressions on detecting and assessing the influential observations is examined.  相似文献   

17.
In fitting regression model, one or more observations may have substantial effects on estimators. These unusual observations are precisely detected by a new diagnostic measure, Pena's statistic. In this article, we introduce a type of Pena's statistic for each point in Liu regression. Using the forecast change property, we simplify the Pena's statistic in a numerical sense. It is found that the simplified Pena's statistic behaves quite well as far as detection of influential observations is concerned. We express Pena's statistic in terms of the Liu leverages and residuals. The normality of this statistic is also discussed and it is demonstrated that it can identify a subset of high Liu leverage outliers. For numerical evaluation, simulated studies are given and a real data set has been analysed for illustration.  相似文献   

18.
This article deals with the general form of the hat matrix and the DFBETA measure to detect the influential observations and the leverages in the linear regression model with more than one regressor when the errors are from AR(1) and AR(2) processes. Previous studies dealing with the influential observations and the leverages in the constant mean model and regression through the origin model are obtained as special cases. To demonstrate the utility of the hat matrix and the DFBETA measure, two numerical examples based on the ice cream consumption data with AR(1) errors and the Fox-Hartnagel data with AR(2) errors are analyzed. The results show that the parameter of the autoregressive process affects the influential and leverage points.  相似文献   

19.
In this paper, two new multiple influential observation detection methods, GCD.GSPR and mCD*, are introduced for logistic regression. The proposed diagnostic measures are compared with the generalized difference in fits (GDFFITS) and the generalized squared difference in beta (GSDFBETA), which are multiple influential diagnostics. The simulation study is conducted with one, two and five independent variable logistic regression models. The performance of the diagnostic measures is examined for a single contaminated independent variable for each model and in the case where all the independent variables are contaminated with certain contamination rates and intensity. In addition, the performance of the diagnostic measures is compared in terms of the correct identification rate and swamping rate via a frequently referred to data set in the literature.  相似文献   

20.
In multivariate regression, a graphical diagnostic method of detecting observations that are influential in estimating regression coefficients is introduced. It is based on the principal components and their variances obtained from the covariance matrix of the probability distribution for the change in the estimator of the matrix of unknown regression coefficients due to a single-case deletion. As a result, each deletion statistic obtained in a form of matrix is transformed into a two-dimensional quantity. Its univariate version is also introduced in a little different way. No distributional form is assumed. For illustration, we provide a numerical example in which the graphical method introduced here is seen to be effective in getting information about influential observations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号