首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The k-means algorithm is one of the most common non hierarchical methods of clustering. It aims to construct clusters in order to minimize the within cluster sum of squared distances. However, as most estimators defined in terms of objective functions depending on global sums of squares, the k-means procedure is not robust with respect to atypical observations in the data. Alternative techniques have thus been introduced in the literature, e.g., the k-medoids method. The k-means and k-medoids methodologies are particular cases of the generalized k-means procedure. In this article, focus is on the error rate these clustering procedures achieve when one expects the data to be distributed according to a mixture distribution. Two different definitions of the error rate are under consideration, depending on the data at hand. It is shown that contamination may make one of these two error rates decrease even under optimal models. The consequence of this will be emphasized with the comparison of influence functions and breakdown points of these error rates.  相似文献   

2.
In an attempt to apply robust procedures, conventional t-tables are used to approximate critical values of a Studentized t-statistic which is formed from the ratio of a trimmed mean to the square root of a suitably normed Winsorized sum of squared deviations. It is shown here that the approximation is poor if the proportion of trimming is chosen to depend on the data. Instead a data dependent alternative is given which uses adaptive trimming proportions and confidence intervals based on trimmed likelihood statistics. Resulting statistics have high efficiency at the normal model, proper coverage for confidence intervals, yet retain breakdown point one half. Average lengths of confidence intervals are competitive with those of recent Studentized confidence intervals based on the biweight over a range of underlying distributions. In addition, the adaptive trimming is used to identify potential outliers. Evidence in the form of simulations and data analysis support the new adaptive trimming approach.  相似文献   

3.
As the treatments of cancer progress, a certain number of cancers are curable if diagnosed early. In population‐based cancer survival studies, cure is said to occur when mortality rate of the cancer patients returns to the same level as that expected for the general cancer‐free population. The estimates of cure fraction are of interest to both cancer patients and health policy makers. Mixture cure models have been widely used because the model is easy to interpret by separating the patients into two distinct groups. Usually parametric models are assumed for the latent distribution for the uncured patients. The estimation of cure fraction from the mixture cure model may be sensitive to misspecification of latent distribution. We propose a Bayesian approach to mixture cure model for population‐based cancer survival data, which can be extended to county‐level cancer survival data. Instead of modeling the latent distribution by a fixed parametric distribution, we use a finite mixture of the union of the lognormal, loglogistic, and Weibull distributions. The parameters are estimated using the Markov chain Monte Carlo method. Simulation study shows that the Bayesian method using a finite mixture latent distribution provides robust inference of parameter estimates. The proposed Bayesian method is applied to relative survival data for colon cancer patients from the Surveillance, Epidemiology, and End Results (SEER) Program to estimate the cure fractions. The Canadian Journal of Statistics 40: 40–54; 2012 © 2012 Statistical Society of Canada  相似文献   

4.
Formal inference in randomized clinical trials is based on controlling the type I error rate associated with a single pre‐specified statistic. The deficiency of using just one method of analysis is that it depends on assumptions that may not be met. For robust inference, we propose pre‐specifying multiple test statistics and relying on the minimum p‐value for testing the null hypothesis of no treatment effect. The null hypothesis associated with the various test statistics is that the treatment groups are indistinguishable. The critical value for hypothesis testing comes from permutation distributions. Rejection of the null hypothesis when the smallest p‐value is less than the critical value controls the type I error rate at its designated value. Even if one of the candidate test statistics has low power, the adverse effect on the power of the minimum p‐value statistic is not much. Its use is illustrated with examples. We conclude that it is better to rely on the minimum p‐value rather than a single statistic particularly when that single statistic is the logrank test, because of the cost and complexity of many survival trials. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

5.
Abstract. For probability distributions on ? q, a detailed study of the breakdown properties of some multivariate M‐functionals related to Tyler's [Ann. Statist. 15 (1987) 234] ‘distribution‐free’ M‐functional of scatter is given. These include a symmetrized version of Tyler's M‐functional of scatter, and the multivariate t M‐functionals of location and scatter. It is shown that for ‘smooth’ distributions, the (contamination) breakdown point of Tyler's M‐functional of scatter and of its symmetrized version are 1/q and , respectively. For the multivariate t M‐functional which arises from the maximum likelihood estimate for the parameters of an elliptical t distribution on ν ≥ 1 degrees of freedom the breakdown point at smooth distributions is 1/( q + ν). Breakdown points are also obtained for general distributions, including empirical distributions. Finally, the sources of breakdown are investigated. It turns out that breakdown can only be caused by contaminating distributions that are concentrated near low‐dimensional subspaces.  相似文献   

6.
We study the problem of merging homogeneous groups of pre-classified observations from a robust perspective motivated by the anti-fraud analysis of international trade data. This problem may be seen as a clustering task which exploits preliminary information on the potential clusters, available in the form of group-wise linear regressions. Robustness is then needed because of the sensitivity of likelihood-based regression methods to deviations from the postulated model. Through simulations run under different contamination scenarios, we assess the impact of outliers both on group-wise regression fitting and on the quality of the final clusters. We also compare alternative robust methods that can be adopted to detect the outliers and thus to clean the data. One major conclusion of our study is that the use of robust procedures for preliminary outlier detection is generally recommended, except perhaps when contamination is weak and the identification of cluster labels is more important than the estimation of group-specific population parameters. We also apply the methodology to find homogeneous groups of transactions in one empirical example that illustrates our motivating anti-fraud framework.  相似文献   

7.
Robustness against design breakdown following observation loss is investigated for Partially Balanced Incomplete Block Designs with two associate classes (PBIBD(2)s). New results are obtained which add to the body of knowledge on PBIBD(2)s. In particular, using an approach based on the E‐value of a design, all PBIBD(2)s with triangular and Latin square association schemes are established as having optimal block breakdown number. Furthermore, for group divisible designs not covered by existing results in the literature, a sufficient condition for optimal block breakdown number establishes that all members of some design sub‐classes have this property.  相似文献   

8.
The trend test is often used for the analysis of 2×K ordered categorical data, in which K pre-specified increasing scores are used. There have been discussions on how to assign these scores and the impact of the outcomes on different scores. The scores are often assigned based on the data-generating model. When this model is unknown, using the trend test is not robust. We discuss the weighted average of a trend test over all scientifically plausible choices of scores or models. This approach is more computationally efficient than a commonly used robust test MAX when K is large. Our discussion is for any ordered 2×K table, but simulation and applications to real data are focused on case-control genetic association studies. Although there is no single test optimal for all choices of scores, our numerical results show that some score averaging tests can achieve the performance of MAX.  相似文献   

9.

The purpose of this paper is to show in regression clustering how to choose the most relevant solutions, analyze their stability, and provide information about best combinations of optimal number of groups, restriction factor among the error variance across groups and level of trimming. The procedure is based on two steps. First we generalize the information criteria of constrained robust multivariate clustering to the case of clustering weighted models. Differently from the traditional approaches which are based on the choice of the best solution found minimizing an information criterion (i.e. BIC), we concentrate our attention on the so called optimal stable solutions. In the second step, using the monitoring approach, we select the best value of the trimming factor. Finally, we validate the solution using a confirmatory forward search approach. A motivating example based on a novel dataset concerning the European Union trade of face masks shows the limitations of the current existing procedures. The suggested approach is initially applied to a set of well known datasets in the literature of robust regression clustering. Then, we focus our attention on a set of international trade datasets and we provide a novel informative way of updating the subset in the random start approach. The Supplementary material, in the spirit of the Special Issue, deepens the analysis of trade data and compares the suggested approach with the existing ones available in the literature.

  相似文献   

10.
We discuss moving window techniques for fast extraction of a signal composed of monotonic trends and abrupt shifts from a noisy time series with irrelevant spikes. Running medians remove spikes and preserve shifts, but they deteriorate in trend periods. Modified trimmed mean filters use a robust scale estimate such as the median absolute deviation about the median (MAD) to select an adaptive amount of trimming. Application of robust regression, particularly of the repeated median, has been suggested for improving upon the median in trend periods. We combine these ideas and construct modified filters based on the repeated median offering better shift preservation. All these filters are compared w.r.t. fundamental analytical properties and in basic data situations. An algorithm for the update of the MAD running in time O(log n) for window width n is presented as well.  相似文献   

11.
Two-colour microarray experiments form an important tool in gene expression analysis. Due to the high risk of missing observations in microarray experiments, it is fundamental to concentrate not only on optimal designs but also on designs which are robust against missing observations. As an extension of Latif et al. (2009), we define the optimal breakdown number for a collection of designs to describe the robustness, and we calculate the breakdown number for various D-optimal block designs. We show that, for certain values of the numbers of treatments and arrays, the designs which are D-optimal have the highest breakdown number. Our calculations use methods from graph theory.  相似文献   

12.
This paper proposes robust regression to solve the problem of outliers in seemingly unrelated regression (SUR) models. The authors present an adaptation of S‐estimators to SUR models. S‐estimators are robust, have a high breakdown point and are much more efficient than other robust regression estimators commonly used in practice. Furthermore, modifications to Ruppert's algorithm allow a fast evaluation of them in this context. The classical example of U.S. corporations is revisited, and it appears that the procedure gives an interesting insight into the problem.  相似文献   

13.
Abstract

This article introduces a parametric robust way of comparing two population means and two population variances. With large samples the comparison of two means, under model misspecification, is lesser a problem, for, the validity of inference is protected by the central limit theorem. However, the assumption of normality is generally required, so that the inference for the ratio of two variances can be carried out by the familiar F statistic. A parametric robust approach that is insensitive to the distributional assumption will be proposed here. More specifically, it will be demonstrated that the normal likelihood function can be adjusted for asymptotically valid inferences for all underlying distributions with finite fourth moments. The normal likelihood function, on the other hand, is itself robust for the comparison of two means so that no adjustment is needed.  相似文献   

14.
The Nadaraya–Watson estimator is among the most studied nonparametric regression methods. A classical result is that its convergence rate depends on the number of covariates and deteriorates quickly as the dimension grows. This underscores the “curse of dimensionality” and has limited its use in high‐dimensional settings. In this paper, however, we show that the Nadaraya–Watson estimator has an oracle property such that when the true regression function is single‐ or multi‐index, it discovers the low‐rank dependence structure between the response and the covariates, mitigating the curse of dimensionality. Specifically, we prove that, using K‐fold cross‐validation and a positive‐semidefinite bandwidth matrix, the Nadaraya–Watson estimator has a convergence rate that depends on the number of indices rather than on the number of covariates. This result follows by allowing the bandwidths to diverge to infinity rather than restricting them all to converge to zero at certain rates, as in previous theoretical studies.  相似文献   

15.
The rate of population growth ( u ) is an important demographic parameter used to assess the viability of a population and to develop management and conservation agendas. We examined the use of resighting data to estimate u for the snail kite population in Florida from 1997-2000. The analyses consisted of (1) a robust design approach that derives an estimate of u from estimates of population size and (2) the Pradel (1996) temporal symmetry (TSM) approach that directly estimates u using an open-population capture-recapture model. Besides resighting data, both approaches required information on the number of unmarked individuals that were sighted during the sampling periods. The point estimates of u differed between the robust design and TSM approaches, but the 95% confidence intervals overlapped substantially. We believe the differences may be the result of sparse data and do not indicate the inappropriateness of either modelling technique. We focused on the results of the robust design because this approach provided estimates for all study years. Variation among these estimates was smaller than levels of variation among ad hoc estimates based on previously reported index statistics. We recommend that u of snail kites be estimated using capture-resighting methods rather than ad hoc counts.  相似文献   

16.
In comparing a collection of K populations, it is common practice to display in one visualization confidence intervals for the corresponding population parameters θ1, θ2, …, θK. For a pair of confidence intervals that do (or do not) overlap, viewers of the visualization are cognitively compelled to declare that there is not (or there is) a statistically significant difference between the two corresponding population parameters. It is generally well known that the method of examining overlap of pairs of confidence intervals should not be used for formal hypothesis testing. However, use of a single visualization with overlapping and nonoverlapping confidence intervals leads many to draw such conclusions, despite the best efforts of statisticians toward preventing users from reaching such conclusions. In this article, we summarize some alternative visualizations from the literature that can be used to properly test equality between a pair of population parameters. We recommend that these visualizations be used with caution to avoid incorrect statistical inference. The methods presented require only that we have K sample estimates and their associated standard errors. We also assume that the sample estimators are independent, unbiased, and normally distributed.  相似文献   

17.
Robust Statistics considers the quality of statistical decisions in the presence of deviations from the ideal model, where deviations are modelled by neighborhoods of a certain size about the ideal model. We introduce a new concept of optimality (radius-minimaxity) if this size or radius is not precisely known: for this notion, we determine the increase of the maximum risk over the minimax risk in the case that the optimally robust estimator for the false neighborhood radius is used. The maximum increase of the relative risk is minimized in the case that the radius is known only to belong to some interval [r l ,r u ]. We pursue this minmax approach for a number of ideal models and a variety of neighborhoods. Also, the effect of increasing parameter dimension is studied for these models. The minimax increase of relative risk in case the radius is completely unknown, compared with that of the most robust procedure, is 18.1% versus 57.1% and 50.5% versus 172.1% for one-dimensional location and scale, respectively, and less than 1/3 in other typical contamination models. In most models considered so far, the radius needs to be specified only up to a factor , in order to keep the increase of relative risk below 12.5%, provided that the radius–minimax robust estimator is employed. The least favorable radii leading to the radius–minimax estimators turn out small: 5–6% contamination, at sample size 100.   相似文献   

18.
The maxbias function BT() contains much information about the robustness properties of the estimate T. This function satisfies BT(0)=0 and BT()<for all 0<<whereis the breakdown point of T. Hampel (1974)pioneered the study of the limiting behaviour of BT(?) where ? → 0. He computed and optimized the rate γ at which BT(?) approaches 0 when ? → 0. This rate is now called the contamination sensitivity of T, and constitutes one of the cornerstones of the theory of robustness. We show that much can also be learned from the study of the limiting behaviour of BT(?) when ? → ?*. A new robustness measure, called the relative explosion rate, can be obtained by studying the limiting relative maxbias behaviour of two extimates when approaches their common breakdown point ?*. Like the contamination sensitivity, the relative explosion rate can be readily derived from the estimate's score function. General formulae are given for M-estimates of scale and S-, MM- and τ-estimates of regression. We also show that the maxbias behaviour for large ? is largely determined by the curvature of the estimate's score function near zero. This motivates our definition and study of the local order of a score function.  相似文献   

19.
Ordinary least-square (OLS) estimators for a linear model are very sensitive to unusual values in the design space or outliers among y values. Even one single atypical value may have a large effect on the parameter estimates. This article aims to review and describe some available and popular robust techniques, including some recent developed ones, and compare them in terms of breakdown point and efficiency. In addition, we also use a simulation study and a real data application to compare the performance of existing robust methods under different scenarios.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号