期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A comparative study of the K-means algorithm and the normal mixture model for clustering: Bivariate homoscedastic case

Dingxi Qiu 《Journal of statistical planning and inference》2010

The K-means algorithm and the normal mixture model method are two common clustering methods. The K-means algorithm is a popular heuristic approach which gives reasonable clustering results if the component clusters are ball-shaped. Currently, there are no analytical results for this algorithm if the component distributions deviate from the ball-shape. This paper analytically studies how the K-means algorithm changes its classification rule as the normal component distributions become more elongated under the homoscedastic assumption and compares this rule with that of the Bayes rule from the mixture model method. We show that the classification rules of both methods are linear, but the slopes of the two classification lines change in the opposite direction as the component distributions become more elongated. The classification performance of the K-means algorithm is then compared to that of the mixture model method via simulation. The comparison, which is limited to two clusters, shows that the K-means algorithm provides poor classification performances consistently as the component distributions become more elongated while the mixture model method can potentially, but not necessarily, take advantage of this change and provide a much better classification performance. 相似文献

2.

Snipping for robust k-means clustering under component-wise contamination

Alessio Farcomeni 《Statistics and Computing》2014,24(6):907-919

We introduce the concept of snipping, complementing that of trimming, in robust cluster analysis. An observation is snipped when some of its dimensions are discarded, but the remaining are used for clustering and estimation. Snipped k-means is performed through a probabilistic optimization algorithm which is guaranteed to converge to the global optimum. We show global robustness properties of our snipped k-means procedure. Simulations and a real data application to optical recognition of handwritten digits are used to illustrate and compare the approach. 相似文献

3.

A Permutation Based Procedure for Classification Assessment

《统计学通讯:理论与方法》2012,41(16-17):3126-3137

This article proposes a permutation procedure for evaluating the performance of different classification methods. In particular, we focus on two of the most widespread and used classification methodologies: latent class analysis and k-means clustering. The classification performance is assessed by means of a permutation procedure which allows for a direct comparison of the methods, the development of a statistical test, and points out better potential solutions. Our proposal provides an innovative framework for the validation of the data partitioning and offers a guide in the choice of which classification procedure should be used 相似文献

4.

Impact of Contamination on Training and Test Error Rates in Statistical Clustering

C. Ruwet G. Haesbroeck 《统计学通讯:模拟与计算》2013,42(3):394-411

The k-means algorithm is one of the most common non hierarchical methods of clustering. It aims to construct clusters in order to minimize the within cluster sum of squared distances. However, as most estimators defined in terms of objective functions depending on global sums of squares, the k-means procedure is not robust with respect to atypical observations in the data. Alternative techniques have thus been introduced in the literature, e.g., the k-medoids method. The k-means and k-medoids methodologies are particular cases of the generalized k-means procedure. In this article, focus is on the error rate these clustering procedures achieve when one expects the data to be distributed according to a mixture distribution. Two different definitions of the error rate are under consideration, depending on the data at hand. It is shown that contamination may make one of these two error rates decrease even under optimal models. The consequence of this will be emphasized with the comparison of influence functions and breakdown points of these error rates. 相似文献

5.

k-POD: A Method for k-Means Clustering of Missing Data 总被引：1，自引：0，他引：1

Jocelyn T. Chi Eric C. Chi Richard G. Baraniuk 《The American statistician》2013,67(1):91-99

The k-means algorithm is often used in clustering applications but its usage requires a complete data matrix. Missing data, however, are common in many applications. Mainstream approaches to clustering missing data reduce the missing data problem to a complete data formulation through either deletion or imputation but these solutions may incur significant costs. Our k-POD method presents a simple extension of k-means clustering for missing data that works even when the missingness mechanism is unknown, when external information is unavailable, and when there is significant missingness in the data.

[Received November 2014. Revised August 2015.] 相似文献

6.

An optimal response adaptive biased coin design with k heteroscedastic treatments

Thomas E. GwiseJianhui Zhou Feifang Hu 《Journal of statistical planning and inference》2011,141(1):235-242

For comparing treatments in clinical trials, Atkinson (1982) introduced optimal biased coins for balancing patients across treatment assignments by using D-optimality under the assumption of homoscedastic responses of different treatments. However, this assumption can be violated in many real applications. In this paper, we relax the homoscedasticity assumption in the k treatments setting with k>2. A general family of optimal response adaptive biased coin designs are proposed following Atkinson's procedure. Asymptotic properties of the proposed designs are obtained. Some advantages of the proposed design are discussed. 相似文献

7.

A Pitman measure of similarity in k-means for clustering heavy-tailed data

Arman Reybod Javad Etminan Adel Mohammadpour 《统计学通讯:模拟与计算》2019,48(6):1595-1605

One of the most popular methods and algorithms to partition data to k clusters is k-means clustering algorithm. Since this method relies on some basic conditions such as, the existence of mean and finite variance, it is unsuitable for data that their variances are infinite such as data with heavy tailed distribution. Pitman Measure of Closeness (PMC) is a criterion to show how much an estimator is close to its parameter with respect to another estimator. In this article using PMC, based on k-means clustering, a new distance and clustering algorithm is developed for heavy tailed data. 相似文献

8.

Asymptotics for trimmed k-means and associated tolerance zones

《Journal of statistical planning and inference》1999,77(2):247-262

Impartial trimming procedures with respect to general ‘penalty’ functions, Φ, have been recently introduced in Cuesta-Albertos et al. (1997. Ann. Statist. 25, 553–576) in the (generalized) k-means framework. Under regularity assumptions, for real-valued samples, we obtain the asymptotic normality both of the impartial trimmed k-mean estimator (Φ(x)=x²) and of the impartial trimmed k-median estimator (Φ(x)=x).In spite of the additional complexity coming from the several groups setting, the empirical quantile methodology used in Butler (1982. Ann. Statist. 10, 197–204) for the LTS estimator (and subsequently in Tableman (1994. Statist. Probab. Lett. 19, 387–398) for the LTAD estimator) also works in our framework. Besides their relevance for the robust estimation of quantizers, our results open the possibility of considering asymptotic distribution-free tolerance regions, constituted by unions of intervals, for predicting a future observation, generalizing the idea in Butler (1982). 相似文献

9.

k-Means Algorithm in Statistical Shape Analysis

Getulio J. A. Amaral Luiz H. Dore Rosangela P. Lessa Borko Stosic 《统计学通讯:模拟与计算》2013,42(5):1016-1026

In this work it is shown how the k-means method for clustering objects can be applied in the context of statistical shape analysis. Because the choice of the suitable distance measure is a key issue for shape analysis, the Hartigan and Wong k-means algorithm is adapted for this situation. Simulations on controlled artificial data sets demonstrate that distances on the pre-shape spaces are more appropriate than the Euclidean distance on the tangent space. Finally, results are presented of an application to a real problem of oceanography, which in fact motivated the current work. 相似文献

10.

A constrainedk-means clustering algorithm for classifying spatial units

G. Damiana Costanzo 《Statistical Methods and Applications》2001,10(1-3):237-256

In some classification problems it may be important to impose constraints on the set of allowable solutions. In particular, in regional taxonomy, urban and regional studies often try to segment a set of territorial data in homogenous groups with respect to a set of socio-economic variables taking into account, at the same time, contiguous neighbourhoods. The objects in a class are thus required not only to be similar to one another but also to be part of a spatially contiguous set. The rationale behind this is that if a spatially varying phenomenon influences the objects, as could occur in the case of geographical units, and this spatial information were ignored in constructing the classes then it would be less likely to be detected. In this paper a constrained version of thek-means clustering method (MacQueen, 1967; Ball and Hall, 1967) and a new algorithm for devising such a procedure are proposed; the latter is based on the efficient algorithm proposed by Hartigan and Wong (1979). This algorithm has proved its usefulness in zoning two large regions in Italy (Calabria and Puglia). 相似文献

11.

Circular consecutive k-out-of-n systems with exchangeable dependent components

Serkan Ery?lmaz 《Journal of statistical planning and inference》2011,141(2):725-733

In this paper we study circular consecutive k-out-of-n systems consisting of exchangeable components. We derive explicit expressions for both unconditional and conditional survival functions for 2k+1≥n, while signature based mixture representations for general k are obtained. The applications and computational results concerned with mean residual life function and stochastic ordering are presented. 相似文献

12.

Representative points for location-biased datasets

Zong-Feng Qi Kai-Tai Fang 《统计学通讯:模拟与计算》2019,48(2):458-471

Representative points (RPs) are a set of points that optimally represents a distribution in terms of mean square error. When the prior data is location biased, the direct methods such as the k-means algorithm may be inefficient to obtain the RPs. In this article, a new indirect algorithm is proposed to search the RPs based on location-biased datasets. Such an algorithm does not constrain the parameter model of the true distribution. The empirical study shows that such algorithm can obtain better RPs than the k-means algorithm. 相似文献

13.

Nonparametric model checking for k-out-of-n systems

Eric Beutner 《Journal of statistical planning and inference》2010

It is an important problem in reliability analysis to decide whether for a given k-out-of-n system the static or the sequential k-out-of-n model is appropriate. Often components are redundantly added to a system to protect against failure of the system. If the failure of any component of the system induces a higher rate of failure of the remaining components due to increased load, the sequential k-out-of-n model is appropriate. The increase of the failure rate of the remaining components after a failure of some component implies that the effects of the component redundancy are diminished. On the other hand, if all the components have the same failure distribution and whenever a failure occurs, the remaining components are not affected, the static k-out-of-n model is adequate. In this paper, we consider nonparametric hypothesis tests to make a decision between these two models. We analyze test statistics based on the profile score process as well as test statistics based on a multivariate intensity ratio and derive their asymptotic distribution. Finally, we compare the different test statistics. 相似文献

14.

Discrete q-distributions on Bernoulli trials with a geometrically varying success probability

Ch.A. Charalambides 《Journal of statistical planning and inference》2010

Consider a sequence of independent Bernoulli trials and assume that the odds of success (or failure) or the probability of success (or failure) at the ith trial varies (increases or decreases) geometrically with rate (proportion) q, for increasing i=1,2,…. Introducing the notion of a geometric sequence of trials as a sequence of Bernoulli trials, with constant probability, that is terminated with the occurrence of the first success, a useful stochastic model is constructed. Specifically, consider a sequence of independent geometric sequences of trials and assume that the probability of success at the jth geometric sequence varies (increases or decreases) geometrically with rate (proportion) q, for increasing j=1,2,…. On both models, let X_n be the number of successes up the nth trial and T_k (or W_k) be the number of trials (or failures) until the occurrence of the kth success. The distributions of these random variables turned out to be q-analogues of the binomial and Pascal (or negative binomial) distributions. The distributions of X_n, for n→∞

n \to \infty

, and the distributions of W_k, for k→∞

k \to \infty

, can be approximated by a q -Poisson distribution. Also, as k→0

k \to 0

, a zero truncated negative q -binomial distribution _U_k=_W_k|_W_k>0

U_{k} = W_{k} | W_{k} > 0

can be approximated by a q-logarithmic distribution. These discrete q-distributions and their applications are reviewed, with critical comments and additions. Finally, consider a sequence of independent Bernoulli trials and assume that the probability of success (or failure) is a product of two sequences of probabilities with one of these sequences depending only the number of trials and the other depending only on the number of successes (or failures). The q-distributions of the number X_n of successes up to the nth trial and the number T_k of trials until the occurrence of the kth success are similarly reviewed. 相似文献

15.

A multiple directional decision procedure for successive comparisons of treatment effects

《Journal of statistical planning and inference》2003,116(1):49-59

Suppose that the k treatments under comparison are ordered in a certain way. For example, there may be a sequence of increasing dose levels of a drug. It is interesting to look directly at the successive differences between the treatment effects μ_i's, namely the set of differences μ₂−μ₁,μ₃−μ₂,…,μ_k−μ_k−1. In particular, directional inferences on whether μ_i<μ_i+1 or μ_i>μ_i+1 for i=1,…,k−1 are useful. Lee and Spurrier (J. Statist. Plann. Inference 43 (1995) 323) present a one- and a two-sided confidence interval procedures for making successive comparisons between treatments. In this paper, we develop a new procedure which is sharper than both the one- and two-sided procedures of Lee and Spurrier in terms of directional inferences. This new procedure is able to make more directional inferences than the two-sided procedure and maintains the inferential sensitivity of the one-sided procedure. Note however this new procedure controls only type III error, but not type I error. The critical point of the new procedure is the same as that of Lee and Spurrier's one-sided procedure. We also propose a power function for the new procedure and determine the sample size necessary for a guaranteed power level. The application of the procedure is illustrated with an example. 相似文献

16.

Lower bounds on the expectations of upper record values

Agnieszka Goroncy Tomasz Rychlik 《Journal of statistical planning and inference》2011,141(8):2726-2737

For infinite sequences of independent random variables with identical continuous distributions, we establish optimal lower bounds on the deviations of the expectations of record values from population means in units generated by the central absolute moments of various orders. The bounds are non-negative for the classic record values, and non-positive for the other kth records with k?2. We also provide analogous bounds for the record increments. 相似文献

17.

A new characterization of Elfving's method for high dimensional computation

Jay Bartroff 《Journal of statistical planning and inference》2012,142(4):863-871

We give a new characterization of Elfving's (1952) method for computing c-optimal designs in k dimensions which gives explicit formulae for the k unknown optimal weights and k unknown signs in Elfving's characterization. This eliminates the need to search over these parameters to compute c-optimal designs, and thus reduces the computational burden from solving a family of optimization problems to solving a single optimization problem for the optimal finite support set. We give two illustrative examples: a high dimensional polynomial regression model and a logistic regression model, the latter showing that the method can be used for locally optimal designs in nonlinear models as well. 相似文献

18.

Explicit formulas for generalized family-wise error rates and unimprovable step-down multiple testing procedures

Alexander Y. Gordon 《Journal of statistical planning and inference》2007

Many multiple testing procedures (MTPs) are available today, and their number is growing. Also available are many type I error rates: the family-wise error rate (FWER), the false discovery rate, the proportion of false positives, and others. Most MTPs are designed to control a specific type I error rate, and it is hard to compare different procedures. We approach the problem by studying the exact level at which threshold step-down (TSD) procedures (an important class of MTPs exemplified by the classic Holm procedure) control the generalized FWER defined as the probability of k

k

or more false rejections. We find that level explicitly for any TSD procedure and any k

k

. No assumptions are made about the dependency structure of the p

p

-values of the individual tests. We derive from our formula a criterion for unimprovability of a procedure in the class of TSD procedures controlling the generalized FWER at a given level. In turn, this criterion implies that for each k

k

the number of such unimprovable procedures is finite and is greater than one if k>1

k > 1

. Consequently, in this case the most rejective procedure in the above class does not exist. 相似文献

19.

Weighting variables in K-means clustering

Myung-Hoe Huh 《Journal of applied statistics》2009,36(1):67-78

The aim of this study is to assign weights w ₁, …, w _m to m clustering variables Z ₁, …, Z _m, so that k groups were uncovered to reveal more meaningful within-group coherence. We propose a new criterion to be minimized, which is the sum of the weighted within-cluster sums of squares and the penalty for the heterogeneity in variable weights w ₁, …, w _m. We will present the computing algorithm for such k-means clustering, a working procedure to determine a suitable value of penalty constant and numerical examples, among which one is simulated and the other two are real. 相似文献

20.

A permutation test approach to the choice of size k for the nearest neighbors classifier

Yinglei Lai Baolin Wu Hongyu Zhao 《Journal of applied statistics》2011,38(10):2289-2302

The k nearest neighbors (k-NN) classifier is one of the most popular methods for statistical pattern recognition and machine learning. In practice, the size k, the number of neighbors used for classification, is usually arbitrarily set to one or some other small numbers, or based on the cross-validation procedure. In this study, we propose a novel alternative approach to decide the size k. Based on a k-NN-based multivariate multi-sample test, we assign each k a permutation test based Z-score. The number of NN is set to the k with the highest Z-score. This approach is computationally efficient since we have derived the formulas for the mean and variance of the test statistic under permutation distribution for multiple sample groups. Several simulation and real-world data sets are analyzed to investigate the performance of our approach. The usefulness of our approach is demonstrated through the evaluation of prediction accuracies using Z-score as a criterion to select the size k. We also compare our approach to the widely used cross-validation approaches. The results show that the size k selected by our approach yields high prediction accuracies when informative features are used for classification, whereas the cross-validation approach may fail in some cases. 相似文献