首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
The presence of extreme outliers in the upper tail data of income distribution affects the Pareto tail modeling. A simulation study is carried out to compare the performance of three types of boxplot in the detection of extreme outliers for Pareto data, including standard boxplot, adjusted boxplot and generalized boxplot. It is found that the generalized boxplot is the best method for determining extreme outliers for Pareto distributed data. For the application, the generalized boxplot is utilized for determining the exreme outliers in the upper tail of Malaysian income distribution. In addition, for this data set, the confidence interval method is applied for examining the presence of dragon-kings, extreme outliers which are beyond the Pareto or power-laws distribution.  相似文献   

2.
It is important to identify outliers since inclusion, especially when using parametric methods, can cause distortion in the analysis and lead to erroneous conclusions. One of the easiest and most useful methods is based on the boxplot. This method is particularly appealing since it does not use any outliers in computing spread. Two methods, one by Carling and another by Schwertman and de Silva, adjust the boxplot method for sample size and skewness. In this paper, the two procedures are compared both theoretically and by Monte Carlo simulations. Simulations using both a symmetric distribution and an asymmetric distribution were performed on data sets with none, one, and several outliers. Based on the simulations, the Carling approach is superior in avoiding masking outliers, that is, the Carling method is less likely to overlook an outlier while the Schwertman and de Silva procedure is much better at reducing swamping, that is, misclassifying an observation as an outlier. Carling’s method is to the Schwertman and de Silva procedure as comparisonwise versus experimentwise error rate is for multiple comparisons. The two methods, rather than being competitors, appear to complement each other. Used in tandem they provide the data analyst a more complete prospective for identifying possible outliers.  相似文献   

3.
The boxplot is an effective data-visualization tool useful in diverse applications and disciplines. Although more sophisticated graphical methods exist, the boxplot remains relevant due to its simplicity, interpretability, and usefulness, even in the age of big data. This article highlights the origins and developments of the boxplot that is now widely viewed as an industry standard as well as its inherent limitations when dealing with data from skewed distributions, particularly when detecting outliers. The proposed Ratio-Skewed boxplot is shown to be practical and suitable for outlier labeling across several parametric distributions.  相似文献   

4.
Outlier detection has been used extensively in data analysis to detect anomalous observation in data. It has important applications such as in fraud detection and robust analysis, among others. In this paper, we propose a method in detecting multiple outliers in linear functional relationship model for circular variables. Using the residual values of the Caires and Wyatt model, we applied the hierarchical clustering approach. With the use of a tree diagram, we illustrate the detection of outliers graphically. A Monte Carlo simulation study is done to verify the accuracy of the proposed method. Low probability of masking and swamping effects indicate the validity of the proposed approach. Also, the illustrations to two sets of real data are given to show its practical applicability.  相似文献   

5.
函数型数据本质上是一种复杂数据,其抽样、生成、结构和关联程度都会影响到数据的复杂性和描述性,有些情形甚至连基本的可视化描述都成为难点。在利用函数型数据的主成分得分、图基的数据深度和密度概念的基础上,引入函数型数据的打包图和箱线图,并针对函数型数据的图形分析提出了函数型数据异常值检测的三种方法。与已有的检测方法相比较,所提三种方法更易于识别函数型数据的异常值。  相似文献   

6.
A boxplot is a simple and effective exploratory data analysis tool for graphically summarizing a distribution of data. However, in cases where the quartiles in a boxplot are inaccurately estimated, these estimates can affect subsequent analyses. In this paper, we consider the problem of constructing boxplots in a bivariate setting with a categorical covariate with multiple subgroups, and assume that some of these boxplots can be clustered. We propose to use this grouping property to improve the estimation of the quartiles. We demonstrate that the proposed method more accurately estimates the quartiles compared to the usual boxplot. It is also shown that the proposed method identifies outliers effectively as a consequence of accurate quartiles, and possesses a clustering effect due to the group property. We then apply the proposed method to annual maximum precipitation data in South Korea and present its clustering results.  相似文献   

7.
We propose a robust estimation procedure for the analysis of longitudinal data including a hidden process to account for unobserved heterogeneity between subjects in a dynamic fashion. We show how to perform estimation by an expectation–maximization-type algorithm in the hidden Markov regression literature. We show that the proposed robust approaches work comparably to the maximum-likelihood estimator when there are no outliers and the error is normal and outperform it when there are outliers or the error is heavy tailed. A real data application is used to illustrate our proposal. We also provide details on a simple criterion to choose the number of hidden states.  相似文献   

8.
Support Vector Regression (SVR) is gaining in popularity in the detection of outliers and classification problems in high-dimensional data (HDD) as this technique does not require the data to be of full rank. In real application, most of the data are of high dimensional. Classification of high-dimensional data is needed in applied sciences, in particular, as it is important to discriminate cancerous cells from non-cancerous cells. It is also imperative that outliers are identified before constructing a model on the relationship between the dependent and independent variables to avoid misleading interpretations about the fitting of a model. The standard SVR and the μ-ε-SVR are able to detect outliers; however, they are computationally expensive. The fixed parameters support vector regression (FP-ε-SVR) was put forward to remedy this issue. However, the FP-ε-SVR using ε-SVR is not very successful in identifying outliers. In this article, we propose an alternative method to detect outliers i.e. by employing nu-SVR. The merit of our proposed method is confirmed by three real examples and the Monte Carlo simulation. The results show that our proposed nu-SVR method is very successful in identifying outliers under a variety of situations, and with less computational running time.  相似文献   

9.
Functional time series whose sample elements are recorded sequentially over time are frequently encountered with increasing technology. Recent studies have shown that analyzing and forecasting of functional time series can be performed easily using functional principal component analysis and existing univariate/multivariate time series models. However, the forecasting performance of such functional time series models may be affected by the presence of outlying observations which are very common in many scientific fields. Outliers may distort the functional time series model structure, and thus, the underlying model may produce high forecast errors. We introduce a robust forecasting technique based on weighted likelihood methodology to obtain point and interval forecasts in functional time series in the presence of outliers. The finite sample performance of the proposed method is illustrated by Monte Carlo simulations and four real-data examples. Numerical results reveal that the proposed method exhibits superior performance compared with the existing method(s).  相似文献   

10.
We model the Alzheimer's disease-related phenotype response variables observed on irregular time points in longitudinal Genome-Wide Association Studies as sparse functional data and propose nonparametric test procedures to detect functional genotype effects while controlling the confounding effects of environmental covariates. Our new functional analysis of covariance tests are based on a seemingly unrelated kernel smoother, which takes into account the within-subject temporal correlations, and thus enjoy improved power over existing functional tests. We show that the proposed test combined with a uniformly consistent nonparametric covariance function estimator enjoys the Wilks phenomenon and is minimax most powerful. Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative database, where an application of the proposed test lead to the discovery of new genes that may be related to Alzheimer's disease.  相似文献   

11.
Detection of outliers or influential observations is an important work in statistical modeling, especially for the correlated time series data. In this paper we propose a new procedure to detect patch of influential observations in the generalized autoregressive conditional heteroskedasticity (GARCH) model. Firstly we compare the performance of innovative perturbation scheme, additive perturbation scheme and data perturbation scheme in local influence analysis. We find that the innovative perturbation scheme give better result than other two schemes although this perturbation scheme may suffer from masking effects. Then we use the stepwise local influence method under innovative perturbation scheme to detect patch of influential observations and uncover the masking effects. The simulated studies show that the new technique can successfully detect a patch of influential observations or outliers under innovative perturbation scheme. The analysis based on simulation studies and two real data sets show that the stepwise local influence method under innovative perturbation scheme is efficient for detecting multiple influential observations and dealing with masking effects in the GARCH model.  相似文献   

12.
Whether an extreme observation is an outlier or not depends strongly on the corresponding tail behavior of the underlying distribution. We develop an automatic, data-driven method rooted in the mathematical theory of extremes to identify observations that deviate from the intermediate and central characteristics. The proposed algorithm is an extension of a method previously proposed in the literature for the specific case of heavy tailed Pareto-type distributions to all max-domains of attraction. We propose some applications such as a tail-adjusted boxplot which yields a more accurate representation of possible outliers, and the identification of outliers in a multivariate context through an analysis of associated random variables such as local outlier factors. Several examples and simulation results illustrate the finite sample behavior of the algorithm and its applications.  相似文献   

13.
Functional data are being observed frequently in many scientific fields, and therefore most of the standard statistical methods are being adapted for functional data. The multivariate analysis of variance problem for functional data is considered. It seems to be of practical interest similarly as the one-way analysis of variance for such data. For the MANOVA problem for multivariate functional data, we propose permutation tests based on a basis function representation and tests based on random projections. Their performance is examined in comprehensive simulation studies, which provide an idea of the size control and power of the tests and identify differences between them. The simulation experiments are based on artificial data and real labeled multivariate time series data found in the literature. The results suggest that the studied testing procedures can detect small differences between vectors of curves even with small sample sizes. Illustrative real data examples of the use of the proposed testing procedures in practice are also presented.  相似文献   

14.
函数性数据的统计分析:思想、方法和应用   总被引:9,自引:0,他引:9       下载免费PDF全文
严明义 《统计研究》2007,24(2):87-94
 摘  要:实际中,越来越多的研究领域所收集到的样本观测数据具有函数性特征,这种函数性数据是融合时间序列和横截面两者的数据,有些甚是曲线或其他函数图像。虽然计量经济学近二十多年来发展的面板数据分析方法,具有很好的应用价值,但是面板数据只是函数性数据的一种特殊类型,且其分析方法太过于依赖模型的线性结构和假设条件等。本文基于函数性数据的普遍特征,介绍一种对其进行分析的全新方法,并率先使用该方法对经济函数性数据进行分析,拓展了函数性数据分析的应用范围。分析结果表明,函数性数据分析方法,较之计量经济学和其他统计方法具有更多的优越性,尤其能够揭示其他方法所不能揭示的数据特征  相似文献   

15.
In this article, we discuss the estimation of the parameter function for a functional logistic regression model in the presence of outliers. We consider ways that allow for the parameter estimator to be resistant to outliers, in addition to minimizing multicollinearity and reducing the high dimensionality, which is inherent with functional data. To achieve this, the functional covariates and functional parameter of the model are approximated in a finite-dimensional space generated by an appropriate basis. This approach reduces the functional model to a standard multiple logistic model with highly collinear covariates and potential high-dimensionality issues. The proposed estimator tackles these issues and also minimizes the effect of functional outliers. Results from a simulation study and a real world example are also presented to illustrate the performance of the proposed estimator.  相似文献   

16.
This article discusses the estimation of the parameter function for a functional linear regression model under heavy-tailed errors' distributions and in the presence of outliers. Standard approaches of reducing the high dimensionality, which is inherent in functional data, are considered. After reducing the functional model to a standard multiple linear regression model, a weighted rank-based procedure is carried out to estimate the regression parameters. A Monte Carlo simulation and a real-world example are used to show the performance of the proposed estimator and a comparison made with the least-squares and least absolute deviation estimators.  相似文献   

17.
Summary.  As a part of the EUREDIT project new methods to detect multivariate outliers in incomplete survey data have been developed. These methods are the first to work with sampling weights and to be able to cope with missing values. Two of these methods are presented here. The epidemic algorithm simulates the propagation of a disease through a population and uses extreme infection times to find outlying observations. Transformed rank correlations are robust estimates of the centre and the scatter of the data. They use a geometric transformation that is based on the rank correlation matrix. The estimates are used to define a Mahalanobis distance that reveals outliers. The two methods are applied to a small data set and to one of the evaluation data sets of the EUREDIT project.  相似文献   

18.
A simple univariate outlier identification procedure is presented for the detection of multiple outliers in large and moderate sized data sets. This procedure is a modification of the well-known boxplot outlier-labeling rule. Critical values are easy to obtain for the large sample case for a variety of useful distributions, including the normal, t, gamma, and Weibull. Simple adjustment formulas and graphs are provided for handling smaller samples. Basic probability properties are obtained mathematically and through simulation. Two data sets illustrate the procedure's application as a simple and effective screening tool for both moderate and large-sized univariate samples.  相似文献   

19.
The use of logistic regression modeling has seen a great deal of attention in the literature in recent years. This includes all aspects of the logistic regression model including the identification of outliers. A variety of methods for the identification of outliers, such as the standardized Pearson residuals, are now available in the literature. These methods, however, are successful only if the data contain a single outlier. In the presence of multiple outliers in the data, which is often the case in practice, these methods fail to detect the outliers. This is due to the well-known problems of masking (false negative) and swamping (false positive) effects. In this article, we propose a new method for the identification of multiple outliers in logistic regression. We develop a generalized version of standardized Pearson residuals based on group deletion and then propose a technique for identifying multiple outliers. The performance of the proposed method is then investigated through several examples.  相似文献   

20.
Estimating location is a central problem in functional data analysis, yet most current estimation procedures either unrealistically assume completely observed trajectories or lack robustness with respect to the many kinds of anomalies one can encounter in the functional setting. To remedy these deficiencies we introduce the first class of optimal robust location estimators based on discretely sampled functional data. The proposed method is based on M-type smoothing spline estimation with repeated measurements and is suitable for both commonly and independently observed trajectories that are subject to measurement error. We show that under suitable assumptions the proposed family of estimators is minimax rate optimal both for commonly and independently observed trajectories and we illustrate its highly competitive performance and practical usefulness in a Monte-Carlo study and a real-data example involving recent Covid-19 data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号