共查询到11条相似文献,搜索用时 0 毫秒
1.
Leverage values are being used in regression diagnostics as measures of unusual observations in the X-space. Detection of high leverage observations or points is crucial due to their responsibility for masking outliers. In linear regression, high leverage points (HLP) are those that stand far apart from the center (mean) of the data and hence the most extreme points in the covariate space get the highest leverage. But Hosemer and Lemeshow [Applied logistic regression, Wiley, New York, 1980] pointed out that in logistic regression, the leverage measure contains a component which can make the leverage values of genuine HLP misleadingly very small and that creates problem in the correct identification of the cases. Attempts have been made to identify the HLP based on the median distances from the mean, but since they are designed for the identification of a single high leverage point they may not be very effective in the presence of multiple HLP due to their masking (false–negative) and swamping (false–positive) effects. In this paper we propose a new method for the identification of multiple HLP in logistic regression where the suspect cases are identified by a robust group deletion technique and they are confirmed using diagnostic techniques. The usefulness of the proposed method is then investigated through several well-known examples and a Monte Carlo simulation. 相似文献
2.
Detection of multiple unusual observations such as outliers, high leverage points and influential observations (IOs) in regression is still a challenging task for statisticians due to the well-known masking and swamping effects. In this paper we introduce a robust influence distance that can identify multiple IOs, and propose a sixfold plotting technique based on the well-known group deletion approach to classify regular observations, outliers, high leverage points and IOs simultaneously in linear regression. Experiments through several well-referred data sets and simulation studies demonstrate that the proposed algorithm performs successfully in the presence of multiple unusual observations and can avoid masking and/or swamping effects. 相似文献
3.
A. H. M. Rahmatullah Imon 《Journal of applied statistics》2005,32(9):929-946
The identification of influential observations has drawn a great deal of attention in regression diagnostics. Most of these identification techniques are based on single case deletion and among them DFFITS has become very popular with the statisticians. But this technique along with all other single case diagnostics may be ineffective in the presence of multiple influential observations. In this paper we develop a generalized version of DFFITS based on group deletion and then propose a new technique to identify multiple influential observations using this. The advantage of using the proposed method in the identification of multiple influential cases is then investigated through several well-referred data sets. 相似文献
4.
Since the seminal paper by Cook (1977) in which he introduced Cook's distance, the identification of influential observations has received a great deal of interest and extensive investigation in linear regression. It is well documented that most of the popular diagnostic measures that are based on single-case deletion can mislead the analysis in the presence of multiple influential observations because of the well-known masking and/or swamping phenomena. Atkinson (1981) proposed a modification of Cook's distance. In this paper we propose a further modification of the Cook's distance for the identification of a single influential observation. We then propose new measures for the identification of multiple influential observations, which are not affected by the masking and swamping problems. The efficiency of the new statistics is presented through several well-known data sets and a simulation study. 相似文献
5.
The stalactite plot for the detection of multivariate outliers 总被引:1,自引:0,他引:1
Detection of multiple outliers in multivariate data using Mahalanobis distances requires robust estimates of the means and covariance of the data. We obtain this by sequential construction of an outlier free subset of the data, starting from a small random subset. The stalactite plot provides a cogent summary of suspected outliers as the subset size increases. The dependence on subset size can be virtually removed by a simulation-based normalization. Combined with probability plots and resampling procedures, the stalactite plot, particularly in its normalized form, leads to identification of multivariate outliers, even in the presence of appreciable masking. 相似文献
6.
A. A.M. Nurunnabi A. H.M. Rahmatullah Imon M. Nasser 《Journal of applied statistics》2010,37(10):1605-1624
The identification of influential observations in logistic regression has drawn a great deal of attention in recent years. Most of the available techniques like Cook's distance and difference of fits (DFFITS) are based on single-case deletion. But there is evidence that these techniques suffer from masking and swamping problems and consequently fail to detect multiple influential observations. In this paper, we have developed a new measure for the identification of multiple influential observations in logistic regression based on a generalized version of DFFITS. The advantage of the proposed method is then investigated through several well-referred data sets and a simulation study. 相似文献
7.
Angela Montanari 《Statistical Methods and Applications》1995,4(1):89-100
In this paper the most commonly used diagnostic criteria for the identification of outliers or leverage points in the ordinary
regression model are reviewed.
Their use in the context of the errors-in-variables (e.v.) linear model is discussed and evidence is given that under the
e.v. model assumptions the distinction between outliers and leverage points no longer exists. 相似文献
8.
A Bayesian approach is considered to detect a change-point in the intercept of simple linear regression. The Jeffreys noninformative prior is employed and compared with the uniform prior in Bayesian analysis. The marginal posterior distributions of the change-point, the amount of shift and the slope are derived. Mean square errors, mean absolute errors and mean biases of some Bayesian estimates are considered by Monte Carlo methad and some numerical results are also shown. 相似文献
9.
《Journal of Statistical Computation and Simulation》2012,82(3):357-366
The use of goodness-of-fit test based on Anderson–Darling (AD) statistic is discussed, with reference to the composite hypothesis that a sample of observations comes from a generalized Rayleigh distribution whose parameters are unspecified. Monte Carlo simulation studies were performed to calculate the critical values for AD test. These critical values are then used for testing whether a set of observations follows a generalized Rayleigh distribution when the scale and shape parameters are unspecified and are estimated from the sample. Functional relationship between the critical values of AD is also examined for each shape parameter (α), sample size (n) and significance level (γ). The power study is performed with the hypothesized generalized Rayleigh against alternate distributions. 相似文献
10.
11.
Valentina Mameli Debora Slanzi Irene Poli Darren V.S. Green 《Pharmaceutical statistics》2021,20(4):898-915
One of the main problems that the drug discovery research field confronts is to identify small molecules, modulators of protein function, which are likely to be therapeutically useful. Common practices rely on the screening of vast libraries of small molecules (often 1–2 million molecules) in order to identify a molecule, known as a lead molecule, which specifically inhibits or activates the protein function. To search for the lead molecule, we investigate the molecular structure, which generally consists of an extremely large number of fragments. Presence or absence of particular fragments, or groups of fragments, can strongly affect molecular properties. We study the relationship between molecular properties and its fragment composition by building a regression model, in which predictors, represented by binary variables indicating the presence or absence of fragments, are grouped in subsets and a bi-level penalization term is introduced for the high dimensionality of the problem. We evaluate the performance of this model in two simulation studies, comparing different penalization terms and different clustering techniques to derive the best predictor subsets structure. Both studies are characterized by small sets of data relative to the number of predictors under consideration. From the results of these simulation studies, we show that our approach can generate models able to identify key features and provide accurate predictions. The good performance of these models is then exhibited with real data about the MMP–12 enzyme. 相似文献