期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Identification and classification of multiple outliers,high leverage points and influential observations in linear regression

A.A.M. Nurunnabi M. Nasser A.H.M.R. Imon 《Journal of applied statistics》2016,43(3):509-525

Detection of multiple unusual observations such as outliers, high leverage points and influential observations (IOs) in regression is still a challenging task for statisticians due to the well-known masking and swamping effects. In this paper we introduce a robust influence distance that can identify multiple IOs, and propose a sixfold plotting technique based on the well-known group deletion approach to classify regular observations, outliers, high leverage points and IOs simultaneously in linear regression. Experiments through several well-referred data sets and simulation studies demonstrate that the proposed algorithm performs successfully in the presence of multiple unusual observations and can avoid masking and/or swamping effects. 相似文献

2.

Masking and swamping effects on tests for multiple outliers in normal sample

S. M. Bendre 《统计学通讯:理论与方法》2013,42(2):697-710

A study of some commonly used multiple outlier tests in case of normal samples is presented. When the number of outliers in the sample is unknown, two phenomena, namely, the masking and the swamping effect can occur. The performance of the tests is studied using the measures of masking and swamping effects proposed by Bendre and Kale (1985) and Bendre (1985). The effects are illustrated in case of the Murphy test, Tietjen—Moore test and Dixon test. A small simulation study is carried out to indicate these effects. 相似文献

3.

Tests for multiple upper or lower outliers in an exponential sample

Jin Zhang 《Journal of applied statistics》1998,25(2):245-255

SUMMARY T = \[x + ... + x ]/ Sigma x (T*= \[x + ... + x ] Sigma x ) is the max k (n- k+ 1 ) (n) i k ( 1 ) (k) i imum likelihood ratio test statistic for k upper ( lower ) outliers in an exponential sample x , ..., x . The null distributions of T for k= 1,2 were given by Fisher and by Kimber 1 n k and Stevens , while those of T*(k= 1,2) were given by Lewis and Fieller . In this paper , k the simple null distributions of T and T* are found for all possible values of k, and k k percentage points are tabulated for k= 1, 2, ..., 8. In addition , we find a way of determining k, which can reduce the masking or ' swamping ' effects . 相似文献

4.

A Perturbation Approach to Outlier Detection in Two-Way Contingency Tables

Andy H. Lee & John S. Yick 《Australian & New Zealand Journal of Statistics》1999,41(3):305-315

In order to identify outliers in contingency tables, we evaluate the derivatives of the perturbation-formed surface of the Pearson goodness-of-fit statistic. The resulting diagnostics are shown to be less susceptible to masking and swamping problems than residual-based measures. A Monte Carlo study further confirms the effectiveness of the proposed diagnostics. 相似文献

5.

Using a mixture model for multiple imputation in the presence of outliers: the 'Healthy for life' project

Michael R. Elliott Nicolas Stettler 《Journal of the Royal Statistical Society. Series C, Applied statistics》2007,56(1):63-78

Summary. We consider the problem of obtaining population-based inference in the presence of missing data and outliers in the context of estimating the prevalence of obesity and body mass index measures from the 'Healthy for life' study. Identifying multiple outliers in a multivariate setting is problematic because of problems such as masking, in which groups of outliers inflate the covariance matrix in a fashion that prevents their identification when included, and swamping, in which outliers skew covariances in a fashion that makes non-outlying observations appear to be outliers. We develop a latent class model that assumes that each observation belongs to one of K unobserved latent classes, with each latent class having a distinct covariance matrix. We consider the latent class covariance matrix with the largest determinant to form an 'outlier class'. By separating the covariance matrix for the outliers from the covariance matrices for the remainder of the data, we avoid the problems of masking and swamping. As did Ghosh-Dastidar and Schafer, we use a multiple-imputation approach, which allows us simultaneously to conduct inference after removing cases that appear to be outliers and to promulgate uncertainty in the outlier status through the model inference. We extend the work of Ghosh-Dastidar and Schafer by embedding the outlier class in a larger mixture model, consider penalized likelihood and posterior predictive distributions to assess model choice and model fit, and develop the model in a fashion to account for the complex sample design. We also consider the repeated sampling properties of the multiple imputation removal of outliers. 相似文献

6.

A clustering approach to detect multiple outliers in linear functional relationship model for circular data

Nurkhairany Amyra Mokhtar Abdul Ghapor Hussin 《Journal of applied statistics》2018,45(6):1041-1051

Outlier detection has been used extensively in data analysis to detect anomalous observation in data. It has important applications such as in fraud detection and robust analysis, among others. In this paper, we propose a method in detecting multiple outliers in linear functional relationship model for circular variables. Using the residual values of the Caires and Wyatt model, we applied the hierarchical clustering approach. With the use of a tree diagram, we illustrate the detection of outliers graphically. A Monte Carlo simulation study is done to verify the accuracy of the proposed method. Low probability of masking and swamping effects indicate the validity of the proposed approach. Also, the illustrations to two sets of real data are given to show its practical applicability. 相似文献

7.

Identification of Multiple Outliers in Logistic Regression

A. H. M. Rahmatullah Imon Ali S. Hadi 《统计学通讯:理论与方法》2013,42(11):1697-1709

The use of logistic regression modeling has seen a great deal of attention in the literature in recent years. This includes all aspects of the logistic regression model including the identification of outliers. A variety of methods for the identification of outliers, such as the standardized Pearson residuals, are now available in the literature. These methods, however, are successful only if the data contain a single outlier. In the presence of multiple outliers in the data, which is often the case in practice, these methods fail to detect the outliers. This is due to the well-known problems of masking (false negative) and swamping (false positive) effects. In this article, we propose a new method for the identification of multiple outliers in logistic regression. We develop a generalized version of standardized Pearson residuals based on group deletion and then propose a technique for identifying multiple outliers. The performance of the proposed method is then investigated through several examples. 相似文献

8.

Distance-based outlier detection for high dimension,low sample size data

Jeongyoun Ahn Myung Hee Lee Jung Ae Lee 《Journal of applied statistics》2019,46(1):13-29

Despite the popularity of high dimension, low sample size data analysis, there has not been enough attention to the sample integrity issue, in particular, a possibility of outliers in the data. A new outlier detection procedure for data with much larger dimensionality than the sample size is presented. The proposed method is motivated by asymptotic properties of high-dimensional distance measures. Empirical studies suggest that high-dimensional outlier detection is more likely to suffer from a swamping effect rather than a masking effect, thus yields more false positives than false negatives. We compare the proposed approaches with existing methods using simulated data from various population settings. A real data example is presented with a consideration on the implication of found outliers. 相似文献

9.

Combining Bayesian method and Kalman smoother for detection additive outlier patches in autoregressive time series

Farideh Mohammadinia Rahim Chinipardaz 《统计学通讯:模拟与计算》2013,42(7):2191-2209

ABSTRACT

This article proposes a development of detecting patches of additive outliers in autoregressive time series models. The procedure improves the existing detection methods via Gibbs sampling. We combine the Bayesian method and the Kalman smoother to present some candidate models of outlier patches and the best model with the minimum Bayesian information criterion (BIC) is selected among them. We propose that this combined Bayesian and Kalman method (CBK) can reduce the masking and swamping effects about detecting patches of additive outliers. The correctness of the method is illustrated by simulated data and then by analyzing a real set of observations. 相似文献

10.

The performance of diagnostic-robust generalized potentials for the identification of multiple high leverage points in linear regression

M. Habshah M. R. Norazan A. H.M. Rahmatullah Imon 《Journal of applied statistics》2009,36(5):507-520

Leverage values are being used in regression diagnostics as measures of influential observations in the $X$-space. Detection of high leverage values is crucial because of their responsibility for misleading conclusion about the fitting of a regression model, causing multicollinearity problems, masking and/or swamping of outliers, etc. Much work has been done on the identification of single high leverage points and it is generally believed that the problem of detection of a single high leverage point has been largely resolved. But there is no general agreement among the statisticians about the detection of multiple high leverage points. When a group of high leverage points is present in a data set, mainly because of the masking and/or swamping effects the commonly used diagnostic methods fail to identify them correctly. On the other hand, the robust alternative methods can identify the high leverage points correctly but they have a tendency to identify too many low leverage points to be points of high leverages which is not also desired. An attempt has been made to make a compromise between these two approaches. We propose an adaptive method where the suspected high leverage points are identified by robust methods and then the low leverage points (if any) are put back into the estimation data set after diagnostic checking. The usefulness of our newly proposed method for the detection of multiple high leverage points is studied by some well-known data sets and Monte Carlo simulations. 相似文献

11.

A comparison of two boxplot methods for detecting univariate outliers which adjust for sample size and asymmetry

Nancy J. Carter Neil C. Schwertman Terry L. Kiser 《Statistical Methodology》2009,6(6):604-621

It is important to identify outliers since inclusion, especially when using parametric methods, can cause distortion in the analysis and lead to erroneous conclusions. One of the easiest and most useful methods is based on the boxplot. This method is particularly appealing since it does not use any outliers in computing spread. Two methods, one by Carling and another by Schwertman and de Silva, adjust the boxplot method for sample size and skewness. In this paper, the two procedures are compared both theoretically and by Monte Carlo simulations. Simulations using both a symmetric distribution and an asymmetric distribution were performed on data sets with none, one, and several outliers. Based on the simulations, the Carling approach is superior in avoiding masking outliers, that is, the Carling method is less likely to overlook an outlier while the Schwertman and de Silva procedure is much better at reducing swamping, that is, misclassifying an observation as an outlier. Carling’s method is to the Schwertman and de Silva procedure as comparisonwise versus experimentwise error rate is for multiple comparisons. The two methods, rather than being competitors, appear to complement each other. Used in tandem they provide the data analyst a more complete prospective for identifying possible outliers. 相似文献

12.

Identification of multiple high leverage points in logistic regression

A.H.M. Rahmatullah Imon Ali S. Hadi 《Journal of applied statistics》2013,40(12):2601-2616

Leverage values are being used in regression diagnostics as measures of unusual observations in the X-space. Detection of high leverage observations or points is crucial due to their responsibility for masking outliers. In linear regression, high leverage points (HLP) are those that stand far apart from the center (mean) of the data and hence the most extreme points in the covariate space get the highest leverage. But Hosemer and Lemeshow [Applied logistic regression, Wiley, New York, 1980] pointed out that in logistic regression, the leverage measure contains a component which can make the leverage values of genuine HLP misleadingly very small and that creates problem in the correct identification of the cases. Attempts have been made to identify the HLP based on the median distances from the mean, but since they are designed for the identification of a single high leverage point they may not be very effective in the presence of multiple HLP due to their masking (false–negative) and swamping (false–positive) effects. In this paper we propose a new method for the identification of multiple HLP in logistic regression where the suspect cases are identified by a robust group deletion technique and they are confirmed using diagnostic techniques. The usefulness of the proposed method is then investigated through several well-known examples and a Monte Carlo simulation. 相似文献

13.

Robust Regression Using Data Partitioning and M-Estimation

Yousung Park Seongyong Kim 《统计学通讯:模拟与计算》2013,42(8):1282-1300

We propose a new robust regression estimator using data partition technique and M estimation (DPM). The data partition technique is designed to define a small fixed number of subsets of the partitioned data set and to produce corresponding ordinary least square (OLS) fits in each subset, contrary to the resampling technique of existing robust estimators such as the least trimmed squares estimator. The proposed estimator shares a common strategy with the median ball algorithm estimator that is obtained from the OLS trial fits only on a fixed number of subsets of the data. We examine performance of the DPM estimator in the eleven challenging data sets and simulation studies. We also compare the DPM with the five commonly used robust estimators using empirical convergence rates relative to the OLS for clean data, robustness through mean squared error and bias, masking and swamping probabilities, the ability of detecting the known outliers, and the regression and affine equivariances. 相似文献

14.

Procedures for the identification of multiple influential observations in linear regression

A.A.M. Nurunnabi Ali S. Hadi A.H.M.R. Imon 《Journal of applied statistics》2014,41(6):1315-1331

Since the seminal paper by Cook (1977) in which he introduced Cook's distance, the identification of influential observations has received a great deal of interest and extensive investigation in linear regression. It is well documented that most of the popular diagnostic measures that are based on single-case deletion can mislead the analysis in the presence of multiple influential observations because of the well-known masking and/or swamping phenomena. Atkinson (1981) proposed a modification of Cook's distance. In this paper we propose a further modification of the Cook's distance for the identification of a single influential observation. We then propose new measures for the identification of multiple influential observations, which are not affected by the masking and swamping problems. The efficiency of the new statistics is presented through several well-known data sets and a simulation study. 相似文献

15.

Cluster-based multivariate outlier identification and re-weighted regression in linear models

Ekele Alih Hong Choon Ong 《Journal of applied statistics》2015,42(5):938-955

A cluster methodology, motivated by a robust similarity matrix is proposed for identifying likely multivariate outlier structure and to estimate weighted least-square (WLS) regression parameters in linear models. The proposed method is an agglomeration of procedures that begins from clustering the n-observations through a test of ‘no-outlier hypothesis’ (TONH) to a weighted least-square regression estimation. The cluster phase partition the n-observations into h-set called main cluster and a minor cluster of size n?h. A robust distance emerge from the main cluster upon which a test of no outlier hypothesis’ is conducted. An initial WLS regression estimation is computed from the robust distance obtained from the main cluster. Until convergence, a re-weighted least-squares (RLS) regression estimate is updated with weights based on the normalized residuals. The proposed procedure blends an agglomerative hierarchical cluster analysis of a complete linkage through the TONH to the Re-weighted regression estimation phase. Hence, we propose to call it cluster-based re-weighted regression (CBRR). The CBRR is compared with three existing procedures using two data sets known to exhibit masking and swamping. The performance of CBRR is further examined through simulation experiment. The results obtained from the data set illustration and the Monte Carlo study shows that the CBRR is effective in detecting multivariate outliers where other methods are susceptible to it. The CBRR does not require enormous computation and is substantially not susceptible to masking and swamping. 相似文献

16.

The Identification of Multiple Outliers in ARIMA Models

《统计学通讯:理论与方法》2013,42(6):1265-1287

Abstract

There are three main problems in the existing procedures for detecting outliers in ARIMA models. The first one is the biased estimation of the initial parameter values that may strongly affect the power to detect outliers. The second problem is the confusion between level shifts and innovative outliers when the series has a level shift. The third problem is masking. We propose a procedure that keeps the powerful features of previous methods but improves the initial parameter estimate, avoids the confusion between innovative outliers and level shifts and includes joint tests for sequences of additive outliers in order to solve the masking problem. A Monte Carlo study and one example of the performance of the proposed procedure are presented. 相似文献

17.

Stepwise local influence in generalized autoregressive conditional heteroskedasticity models

Lei Shi Md. Mostafizur Rahman Wen Gan Jianhua Zhao 《Journal of applied statistics》2015,42(2):428-444

Detection of outliers or influential observations is an important work in statistical modeling, especially for the correlated time series data. In this paper we propose a new procedure to detect patch of influential observations in the generalized autoregressive conditional heteroskedasticity (GARCH) model. Firstly we compare the performance of innovative perturbation scheme, additive perturbation scheme and data perturbation scheme in local influence analysis. We find that the innovative perturbation scheme give better result than other two schemes although this perturbation scheme may suffer from masking effects. Then we use the stepwise local influence method under innovative perturbation scheme to detect patch of influential observations and uncover the masking effects. The simulated studies show that the new technique can successfully detect a patch of influential observations or outliers under innovative perturbation scheme. The analysis based on simulation studies and two real data sets show that the stepwise local influence method under innovative perturbation scheme is efficient for detecting multiple influential observations and dealing with masking effects in the GARCH model. 相似文献

18.

The stalactite plot for the detection of multivariate outliers 总被引：1，自引：0，他引：1

A. C. Atkinson H.-M. Mulira 《Statistics and Computing》1993,3(1):27-35

Detection of multiple outliers in multivariate data using Mahalanobis distances requires robust estimates of the means and covariance of the data. We obtain this by sequential construction of an outlier free subset of the data, starting from a small random subset. The stalactite plot provides a cogent summary of suspected outliers as the subset size increases. The dependence on subset size can be virtually removed by a simulation-based normalization. Combined with probability plots and resampling procedures, the stalactite plot, particularly in its normalized form, leads to identification of multivariate outliers, even in the presence of appreciable masking. 相似文献

19.

Identification of multiple influential observations in logistic regression

A. A.M. Nurunnabi A. H.M. Rahmatullah Imon M. Nasser 《Journal of applied statistics》2010,37(10):1605-1624

The identification of influential observations in logistic regression has drawn a great deal of attention in recent years. Most of the available techniques like Cook's distance and difference of fits (DFFITS) are based on single-case deletion. But there is evidence that these techniques suffer from masking and swamping problems and consequently fail to detect multiple influential observations. In this paper, we have developed a new measure for the identification of multiple influential observations in logistic regression based on a generalized version of DFFITS. The advantage of the proposed method is then investigated through several well-referred data sets and a simulation study. 相似文献

20.

基于稳健主成分回归的统计数据可靠性评估方法 总被引：1，自引：0，他引：1

下载免费PDF全文

卢二坡张焕明《统计研究》2011,28(8):21-27

稳健主成分回归（RPCR）是稳健主成分分析和稳健回归分析结合使用的一种方法,本文首次运用稳健的RPCR及异常值诊断方法,对2008年我国地区经济增长横截面数据可靠性做了评估。评估结果表明：稳健的RPCR方法能更好的克服异常值的影响,使估计结果更加可靠,并能有效的克服经典的主成分回归（CPCR）方法容易出现的多个异常点的掩盖现象;基本可以认为2008年地区经济增长与相关指标数据是匹配的,但部分地区的经济增长数据可能存在可靠性问题。相似文献