期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Unmasking test for multiple upper or lower outliers in normal samples

Jin Zhang Xueren Wang 《Journal of applied statistics》1998,25(2):257-261

SUMMARY The discordancy test for multiple outliers is complicated by problems of masking and swamping. The key to the settlement of the question lies in the determination of k , i.e. the number of 'contaminants' in a sample. Great efforts have been made to solve this problem in recent years, but no effective method has been developed. In this paper, we present two ways of determining k , free from the effects of masking and swamping, when testing upper (lower) outliers in normal samples. Examples are given to illustrate the methods. 相似文献

2.

Identification and classification of multiple outliers,high leverage points and influential observations in linear regression

A.A.M. Nurunnabi M. Nasser A.H.M.R. Imon 《Journal of applied statistics》2016,43(3):509-525

Detection of multiple unusual observations such as outliers, high leverage points and influential observations (IOs) in regression is still a challenging task for statisticians due to the well-known masking and swamping effects. In this paper we introduce a robust influence distance that can identify multiple IOs, and propose a sixfold plotting technique based on the well-known group deletion approach to classify regular observations, outliers, high leverage points and IOs simultaneously in linear regression. Experiments through several well-referred data sets and simulation studies demonstrate that the proposed algorithm performs successfully in the presence of multiple unusual observations and can avoid masking and/or swamping effects. 相似文献

3.

A Perturbation Approach to Outlier Detection in Two-Way Contingency Tables

Andy H. Lee & John S. Yick 《Australian & New Zealand Journal of Statistics》1999,41(3):305-315

In order to identify outliers in contingency tables, we evaluate the derivatives of the perturbation-formed surface of the Pearson goodness-of-fit statistic. The resulting diagnostics are shown to be less susceptible to masking and swamping problems than residual-based measures. A Monte Carlo study further confirms the effectiveness of the proposed diagnostics. 相似文献

4.

Procedures for the identification of multiple influential observations in linear regression

A.A.M. Nurunnabi Ali S. Hadi A.H.M.R. Imon 《Journal of applied statistics》2014,41(6):1315-1331

Since the seminal paper by Cook (1977) in which he introduced Cook's distance, the identification of influential observations has received a great deal of interest and extensive investigation in linear regression. It is well documented that most of the popular diagnostic measures that are based on single-case deletion can mislead the analysis in the presence of multiple influential observations because of the well-known masking and/or swamping phenomena. Atkinson (1981) proposed a modification of Cook's distance. In this paper we propose a further modification of the Cook's distance for the identification of a single influential observation. We then propose new measures for the identification of multiple influential observations, which are not affected by the masking and swamping problems. The efficiency of the new statistics is presented through several well-known data sets and a simulation study. 相似文献

5.

Cluster-based multivariate outlier identification and re-weighted regression in linear models

Ekele Alih Hong Choon Ong 《Journal of applied statistics》2015,42(5):938-955

A cluster methodology, motivated by a robust similarity matrix is proposed for identifying likely multivariate outlier structure and to estimate weighted least-square (WLS) regression parameters in linear models. The proposed method is an agglomeration of procedures that begins from clustering the n-observations through a test of ‘no-outlier hypothesis’ (TONH) to a weighted least-square regression estimation. The cluster phase partition the n-observations into h-set called main cluster and a minor cluster of size n?h. A robust distance emerge from the main cluster upon which a test of no outlier hypothesis’ is conducted. An initial WLS regression estimation is computed from the robust distance obtained from the main cluster. Until convergence, a re-weighted least-squares (RLS) regression estimate is updated with weights based on the normalized residuals. The proposed procedure blends an agglomerative hierarchical cluster analysis of a complete linkage through the TONH to the Re-weighted regression estimation phase. Hence, we propose to call it cluster-based re-weighted regression (CBRR). The CBRR is compared with three existing procedures using two data sets known to exhibit masking and swamping. The performance of CBRR is further examined through simulation experiment. The results obtained from the data set illustration and the Monte Carlo study shows that the CBRR is effective in detecting multivariate outliers where other methods are susceptible to it. The CBRR does not require enormous computation and is substantially not susceptible to masking and swamping. 相似文献

6.

Identification of Multiple Outliers in Logistic Regression

A. H. M. Rahmatullah Imon Ali S. Hadi 《统计学通讯:理论与方法》2013,42(11):1697-1709

The use of logistic regression modeling has seen a great deal of attention in the literature in recent years. This includes all aspects of the logistic regression model including the identification of outliers. A variety of methods for the identification of outliers, such as the standardized Pearson residuals, are now available in the literature. These methods, however, are successful only if the data contain a single outlier. In the presence of multiple outliers in the data, which is often the case in practice, these methods fail to detect the outliers. This is due to the well-known problems of masking (false negative) and swamping (false positive) effects. In this article, we propose a new method for the identification of multiple outliers in logistic regression. We develop a generalized version of standardized Pearson residuals based on group deletion and then propose a technique for identifying multiple outliers. The performance of the proposed method is then investigated through several examples. 相似文献

7.

A clustering approach to detect multiple outliers in linear functional relationship model for circular data

Nurkhairany Amyra Mokhtar Abdul Ghapor Hussin 《Journal of applied statistics》2018,45(6):1041-1051

Outlier detection has been used extensively in data analysis to detect anomalous observation in data. It has important applications such as in fraud detection and robust analysis, among others. In this paper, we propose a method in detecting multiple outliers in linear functional relationship model for circular variables. Using the residual values of the Caires and Wyatt model, we applied the hierarchical clustering approach. With the use of a tree diagram, we illustrate the detection of outliers graphically. A Monte Carlo simulation study is done to verify the accuracy of the proposed method. Low probability of masking and swamping effects indicate the validity of the proposed approach. Also, the illustrations to two sets of real data are given to show its practical applicability. 相似文献

8.

A comparison of some lack of fit tests based on near replicates

James W. Neill Dallas E. Johnson 《统计学通讯:理论与方法》2013,42(10):3533-3570

Several tests for regression lack of fit proposed by Christensen (1989), Shillington (1979) and Neill and Johnson (1985) are compared. The tests considered are applicable for the case of nonreplication and reduce to the classical lack of fit test when independent replications are available. A simulation study is used to compare the size and power of the test procedures for small sample sizes and various configurations of nonreplication. In addition, each test is shown to be consistent as well as invariant with respect to location and scale changes made on the regressor variables. 相似文献

9.

The performance of diagnostic-robust generalized potentials for the identification of multiple high leverage points in linear regression

M. Habshah M. R. Norazan A. H.M. Rahmatullah Imon 《Journal of applied statistics》2009,36(5):507-520

Leverage values are being used in regression diagnostics as measures of influential observations in the $X$-space. Detection of high leverage values is crucial because of their responsibility for misleading conclusion about the fitting of a regression model, causing multicollinearity problems, masking and/or swamping of outliers, etc. Much work has been done on the identification of single high leverage points and it is generally believed that the problem of detection of a single high leverage point has been largely resolved. But there is no general agreement among the statisticians about the detection of multiple high leverage points. When a group of high leverage points is present in a data set, mainly because of the masking and/or swamping effects the commonly used diagnostic methods fail to identify them correctly. On the other hand, the robust alternative methods can identify the high leverage points correctly but they have a tendency to identify too many low leverage points to be points of high leverages which is not also desired. An attempt has been made to make a compromise between these two approaches. We propose an adaptive method where the suspected high leverage points are identified by robust methods and then the low leverage points (if any) are put back into the estimation data set after diagnostic checking. The usefulness of our newly proposed method for the detection of multiple high leverage points is studied by some well-known data sets and Monte Carlo simulations. 相似文献

10.

Tests for multiple upper or lower outliers in an exponential sample

Jin Zhang 《Journal of applied statistics》1998,25(2):245-255

SUMMARY T = \[x + ... + x ]/ Sigma x (T*= \[x + ... + x ] Sigma x ) is the max k (n- k+ 1 ) (n) i k ( 1 ) (k) i imum likelihood ratio test statistic for k upper ( lower ) outliers in an exponential sample x , ..., x . The null distributions of T for k= 1,2 were given by Fisher and by Kimber 1 n k and Stevens , while those of T*(k= 1,2) were given by Lewis and Fieller . In this paper , k the simple null distributions of T and T* are found for all possible values of k, and k k percentage points are tabulated for k= 1, 2, ..., 8. In addition , we find a way of determining k, which can reduce the masking or ' swamping ' effects . 相似文献

11.

Combining Bayesian method and Kalman smoother for detection additive outlier patches in autoregressive time series

Farideh Mohammadinia Rahim Chinipardaz 《统计学通讯:模拟与计算》2013,42(7):2191-2209

ABSTRACT

This article proposes a development of detecting patches of additive outliers in autoregressive time series models. The procedure improves the existing detection methods via Gibbs sampling. We combine the Bayesian method and the Kalman smoother to present some candidate models of outlier patches and the best model with the minimum Bayesian information criterion (BIC) is selected among them. We propose that this combined Bayesian and Kalman method (CBK) can reduce the masking and swamping effects about detecting patches of additive outliers. The correctness of the method is illustrated by simulated data and then by analyzing a real set of observations. 相似文献

12.

Identification of multiple influential observations in logistic regression

A. A.M. Nurunnabi A. H.M. Rahmatullah Imon M. Nasser 《Journal of applied statistics》2010,37(10):1605-1624

The identification of influential observations in logistic regression has drawn a great deal of attention in recent years. Most of the available techniques like Cook's distance and difference of fits (DFFITS) are based on single-case deletion. But there is evidence that these techniques suffer from masking and swamping problems and consequently fail to detect multiple influential observations. In this paper, we have developed a new measure for the identification of multiple influential observations in logistic regression based on a generalized version of DFFITS. The advantage of the proposed method is then investigated through several well-referred data sets and a simulation study. 相似文献

13.

Distance-based outlier detection for high dimension,low sample size data

Jeongyoun Ahn Myung Hee Lee Jung Ae Lee 《Journal of applied statistics》2019,46(1):13-29

Despite the popularity of high dimension, low sample size data analysis, there has not been enough attention to the sample integrity issue, in particular, a possibility of outliers in the data. A new outlier detection procedure for data with much larger dimensionality than the sample size is presented. The proposed method is motivated by asymptotic properties of high-dimensional distance measures. Empirical studies suggest that high-dimensional outlier detection is more likely to suffer from a swamping effect rather than a masking effect, thus yields more false positives than false negatives. We compare the proposed approaches with existing methods using simulated data from various population settings. A real data example is presented with a consideration on the implication of found outliers. 相似文献

14.

Using a mixture model for multiple imputation in the presence of outliers: the 'Healthy for life' project

Michael R. Elliott Nicolas Stettler 《Journal of the Royal Statistical Society. Series C, Applied statistics》2007,56(1):63-78

Summary. We consider the problem of obtaining population-based inference in the presence of missing data and outliers in the context of estimating the prevalence of obesity and body mass index measures from the 'Healthy for life' study. Identifying multiple outliers in a multivariate setting is problematic because of problems such as masking, in which groups of outliers inflate the covariance matrix in a fashion that prevents their identification when included, and swamping, in which outliers skew covariances in a fashion that makes non-outlying observations appear to be outliers. We develop a latent class model that assumes that each observation belongs to one of K unobserved latent classes, with each latent class having a distinct covariance matrix. We consider the latent class covariance matrix with the largest determinant to form an 'outlier class'. By separating the covariance matrix for the outliers from the covariance matrices for the remainder of the data, we avoid the problems of masking and swamping. As did Ghosh-Dastidar and Schafer, we use a multiple-imputation approach, which allows us simultaneously to conduct inference after removing cases that appear to be outliers and to promulgate uncertainty in the outlier status through the model inference. We extend the work of Ghosh-Dastidar and Schafer by embedding the outlier class in a larger mixture model, consider penalized likelihood and posterior predictive distributions to assess model choice and model fit, and develop the model in a fashion to account for the complex sample design. We also consider the repeated sampling properties of the multiple imputation removal of outliers. 相似文献

15.

Lagrange Multiplier Tests for Normality Against Seminonparametric Alternatives

Alastair Hall 《商业与经济统计学杂志》2013,31(4):417-426

In this article, I derive the Lagrange multiplier test of the null hypothesis that a stationary random vector has a (possibly heteroscedastic) normal distribution against the alternative that the distribution is a member of the family with seminonparametric probability density functions considered by Gallant and Tauchen (1989). The test is shown to contain special cases of the moment tests proposed by Newey (1985) and Tauchen (1985). Evidence from a small simulation study is reported, showing that the test has reasonable finite-sample properties in moderately sized samples. The test is applied to the change of price in a treasury-bill data series analyzed by Tauchen and Pitts (1983) and Tauchen (1985). 相似文献

16.

A note on connectedness in fixed effects manova and gmanova models withmissing cells

Leigh W. Murray 《统计学通讯:理论与方法》2013,42(7):2527-2531

Murray and Smith (1985) and Hocking (1985) give a generalized definition and test of connectedness in the case of missing cells using the univariate cell-means model with linear restrictions on the cell-means. The test of connectedness is here extended to multivariate fixed effects models, including the usual MANOVA model with linear restrictions, the MANOVA model with double linear restrictions, and the GMANOVA model. 相似文献

17.

Identification of multiple high leverage points in logistic regression

A.H.M. Rahmatullah Imon Ali S. Hadi 《Journal of applied statistics》2013,40(12):2601-2616

Leverage values are being used in regression diagnostics as measures of unusual observations in the X-space. Detection of high leverage observations or points is crucial due to their responsibility for masking outliers. In linear regression, high leverage points (HLP) are those that stand far apart from the center (mean) of the data and hence the most extreme points in the covariate space get the highest leverage. But Hosemer and Lemeshow [Applied logistic regression, Wiley, New York, 1980] pointed out that in logistic regression, the leverage measure contains a component which can make the leverage values of genuine HLP misleadingly very small and that creates problem in the correct identification of the cases. Attempts have been made to identify the HLP based on the median distances from the mean, but since they are designed for the identification of a single high leverage point they may not be very effective in the presence of multiple HLP due to their masking (false–negative) and swamping (false–positive) effects. In this paper we propose a new method for the identification of multiple HLP in logistic regression where the suspect cases are identified by a robust group deletion technique and they are confirmed using diagnostic techniques. The usefulness of the proposed method is then investigated through several well-known examples and a Monte Carlo simulation. 相似文献

18.

Locally best invariant test for outliers in a gamma type distribution

Nariaki Sugiura Hiromi Sasamoto 《统计学通讯:模拟与计算》2013,42(2):415-427

It is shown that the locally best invariant test for the existence of outliers for scale parameters of the gamma distribution is given by Bartholomew's test for exponentiality which is the ratio of the sum of squares of the data to the square of the sample mean. The optimality robustness, including null and nonnull robustness of the test is shown. A small simulation study to compare the power among the other eight competitive tests for testing exponentiality is performed. It is seen that the locally best invariant test is not always best but is reasonably good. It is slightly better than Cochran's test and suffers less from the limiting masking effect. 相似文献

19.

An empirical study of the type I error rate and power for some selected normal-theory and nonparametric tests of the independence of two sets of variables

Abdul R. Kabib Michael R. Harwell 《统计学通讯:模拟与计算》2013,42(2):793-826

Normal-theory tests of the hypothesis of no relationship among two sets of variables require assumptions of independence, hamoscedasticity, and normality. If, however, the assumption of normality is not tenable, there are few guidelines for properly using these tests. Historically, the lack of a comprehensive hypothesis-testing framework in the nonparametric case has provided few alternatives to normal-theory procedures. Fortunately, this situation has changed with the introduction of nonparametric, general linear model-based tests that can be used with existing computing packages. Multivariate-nonparametric tests due to Puri and Sen (1969, 1971, 1985) and Conover and Iman (1981) are outlined, and the results of a simulation study of the performance of three nonparametric and one normal-theory test of the hypothesis of no relationship among two sets of variables are presented. These results suggest that multivariate-nonparametric tests should be considered for a variety of data conditions. especially heavy-tailed and badly skewed data for small samples and a large number of variates. 相似文献

20.

Properties of two tests for outliers in multivariate data

Martin A. Stapanian Forest C. Garner Kirk E. Fitzgerald George T. Flatman Evan J. Englund 《统计学通讯:模拟与计算》2013,42(2-3):667-687

Mardia's multivariate kurtosis and the generalized distance have desirable properties as multivariate outlier tests. However, extensive critical values have not been published heretofore. A published approximation formula for critical values of the kurtosis is shown to inadequately control the type I error rate, with observed error rates often differing from their intended values by a factor of two or more. Critical values derived from simulations for both tests for up to 25 dimensions and 500 observations are presented. The power curves of both tests are discussed. The generalized distance is the more powerful test when exactly one outlier is present and the contaminant is substantially mean-shifted. However, as the number of outliers increases, the kurtosis becomes the more powerful test. The two tests are compared with respect to power and vulnerability to masking. Recommendations for the use of these tests and interpretation of results are given. 相似文献