期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Impact of Contamination on Training and Test Error Rates in Statistical Clustering

C. Ruwet G. Haesbroeck 《统计学通讯:模拟与计算》2013,42(3):394-411

The k-means algorithm is one of the most common non hierarchical methods of clustering. It aims to construct clusters in order to minimize the within cluster sum of squared distances. However, as most estimators defined in terms of objective functions depending on global sums of squares, the k-means procedure is not robust with respect to atypical observations in the data. Alternative techniques have thus been introduced in the literature, e.g., the k-medoids method. The k-means and k-medoids methodologies are particular cases of the generalized k-means procedure. In this article, focus is on the error rate these clustering procedures achieve when one expects the data to be distributed according to a mixture distribution. Two different definitions of the error rate are under consideration, depending on the data at hand. It is shown that contamination may make one of these two error rates decrease even under optimal models. The consequence of this will be emphasized with the comparison of influence functions and breakdown points of these error rates. 相似文献

2.

A Pitman measure of similarity in k-means for clustering heavy-tailed data

Arman Reybod Javad Etminan Adel Mohammadpour 《统计学通讯:模拟与计算》2019,48(6):1595-1605

One of the most popular methods and algorithms to partition data to k clusters is k-means clustering algorithm. Since this method relies on some basic conditions such as, the existence of mean and finite variance, it is unsuitable for data that their variances are infinite such as data with heavy tailed distribution. Pitman Measure of Closeness (PMC) is a criterion to show how much an estimator is close to its parameter with respect to another estimator. In this article using PMC, based on k-means clustering, a new distance and clustering algorithm is developed for heavy tailed data. 相似文献

3.

Sequential imputation for models with latent variables assuming latent ignorability

Lauren J. Beesley Jeremy M. G. Taylor Roderick J. A. Little 《Australian & New Zealand Journal of Statistics》2019,61(2):213-233

Models that involve an outcome variable, covariates, and latent variables are frequently the target for estimation and inference. The presence of missing covariate or outcome data presents a challenge, particularly when missingness depends on the latent variables. This missingness mechanism is called latent ignorable or latent missing at random and is a generalisation of missing at random. Several authors have previously proposed approaches for handling latent ignorable missingness, but these methods rely on prior specification of the joint distribution for the complete data. In practice, specifying the joint distribution can be difficult and/or restrictive. We develop a novel sequential imputation procedure for imputing covariate and outcome data for models with latent variables under latent ignorable missingness. The proposed method does not require a joint model; rather, we use results under a joint model to inform imputation with less restrictive modelling assumptions. We discuss identifiability and convergence‐related issues, and simulation results are presented in several modelling settings. The method is motivated and illustrated by a study of head and neck cancer recurrence. Imputing missing data for models with latent variables under latent‐dependent missingness without specifying a full joint model. 相似文献

4.

Snipping for robust k-means clustering under component-wise contamination

Alessio Farcomeni 《Statistics and Computing》2014,24(6):907-919

We introduce the concept of snipping, complementing that of trimming, in robust cluster analysis. An observation is snipped when some of its dimensions are discarded, but the remaining are used for clustering and estimation. Snipped k-means is performed through a probabilistic optimization algorithm which is guaranteed to converge to the global optimum. We show global robustness properties of our snipped k-means procedure. Simulations and a real data application to optical recognition of handwritten digits are used to illustrate and compare the approach. 相似文献

5.

k-Means Algorithm in Statistical Shape Analysis

Getulio J. A. Amaral Luiz H. Dore Rosangela P. Lessa Borko Stosic 《统计学通讯:模拟与计算》2013,42(5):1016-1026

In this work it is shown how the k-means method for clustering objects can be applied in the context of statistical shape analysis. Because the choice of the suitable distance measure is a key issue for shape analysis, the Hartigan and Wong k-means algorithm is adapted for this situation. Simulations on controlled artificial data sets demonstrate that distances on the pre-shape spaces are more appropriate than the Euclidean distance on the tangent space. Finally, results are presented of an application to a real problem of oceanography, which in fact motivated the current work. 相似文献

6.

A novel fast heuristic to handle large-scale shape clustering

《Journal of Statistical Computation and Simulation》2012,82(1):160-169

Clustering algorithms like types of k-means are fast, but they are inefficient for shape clustering. There are some algorithms, which are effective, but their time complexities are too high. This paper proposes a novel heuristic to solve large-scale shape clustering. The proposed method is effective and it solves large-scale clustering problems in fraction of a second. 相似文献

7.

Estimating household structure in ancient China by using historical data: a latent class analysis of partially missing patterns

Tim Futing Liao 《Journal of the Royal Statistical Society. Series A, (Statistics in Society)》2004,167(1):125-139

Summary. Social data often contain missing information. The problem is inevitably severe when analysing historical data. Conventionally, researchers analyse complete records only. Listwise deletion not only reduces the effective sample size but also may result in biased estimation, depending on the missingness mechanism. We analyse household types by using population registers from ancient China (618–907 AD) by comparing a simple classification, a latent class model of the complete data and a latent class model of the complete and partially missing data assuming four types of ignorable and non-ignorable missingness mechanisms. The findings show that either a frequency classification or a latent class analysis using the complete records only yielded biased estimates and incorrect conclusions in the presence of partially missing data of a non-ignorable mechanism. Although simply assuming ignorable or non-ignorable missing data produced consistently similarly higher estimates of the proportion of complex households, a specification of the relationship between the latent variable and the degree of missingness by a row effect uniform association model helped to capture the missingness mechanism better and improved the model fit. 相似文献

8.

K-medoids inverse regression

Michael J. Brusco Douglas Steinley Jordan Stevens 《统计学通讯:理论与方法》2013,42(20):4999-5011

Abstract

K-means inverse regression was developed as an easy-to-use dimension reduction procedure for multivariate regression. This approach is similar to the original sliced inverse regression method, with the exception that the slices are explicitly produced by a K-means clustering of the response vectors. In this article, we propose K-medoids clustering as an alternative clustering approach for slicing and compare its performance to K-means in a simulation study. Although the two methods often produce comparable results, K-medoids tends to yield better performance in the presence of outliers. In addition to isolation of outliers, K-medoids clustering also has the advantage of accommodating a broader range of dissimilarity measures, which could prove useful in other graphical regression applications where slicing is required. 相似文献

9.

Bayesian semiparametric models for nonignorable missing mechanisms in generalized linear models

Z. I. Kalaylioglu O. Ozturk 《Journal of applied statistics》2013,40(8):1746-1763

Semiparametric models provide a more flexible form for modeling the relationship between the response and the explanatory variables. On the other hand in the literature of modeling for the missing variables, canonical form of the probability of the variable being missing (p) is modeled taking a fully parametric approach. Here we consider a regression spline based semiparametric approach to model the missingness mechanism of nonignorably missing covariates. In this model the relationship between the suitable canonical form of p (e.g. probit p) and the missing covariate is modeled through several splines. A Bayesian procedure is developed to efficiently estimate the parameters. A computationally advantageous prior construction is proposed for the parameters of the semiparametric part. A WinBUGS code is constructed to apply Gibbs sampling to obtain the posterior distributions. We show through an extensive Monte Carlo simulation experiment that response model coefficent estimators maintain better (when the true missingness mechanism is nonlinear) or equivalent (when the true missingness mechanism is linear) bias and efficiency properties with the use of proposed semiparametric missingness model compared to the conventional model. 相似文献

10.

How to Make Model‐free Feature Screening Approaches for Full Data Applicable to the Case of Missing Response?

《Scandinavian Journal of Statistics》2018,45(2):324-346

It is quite a challenge to develop model‐free feature screening approaches for missing response problems because the existing standard missing data analysis methods cannot be applied directly to high dimensional case. This paper develops some novel methods by borrowing information of missingness indicators such that any feature screening procedures for ultrahigh‐dimensional covariates with full data can be applied to missing response case. The first method is the so‐called missing indicator imputation screening, which is developed by proving that the set of the active predictors of interest for the response is a subset of the active predictors for the product of the response and missingness indicator under some mild conditions. As an alternative, another method called Venn diagram‐based approach is also developed. The sure screening property is proven for both methods. It is shown that the complete case analysis can also keep the sure screening property of any feature screening approach with sure screening property. 相似文献

11.

Simple Measures of Individual Cluster-Membership Certainty for Hard Partitional Clustering

Dongmeng Liu Jinko Graham 《The American statistician》2019,73(1):70-79

We propose two probability-like measures of individual cluster-membership certainty that can be applied to a hard partition of the sample such as that obtained from the partitioning around medoids (PAM) algorithm, hierarchical clustering or k-means clustering. One measure extends the individual silhouette widths and the other is obtained directly from the pairwise dissimilarities in the sample. Unlike the classic silhouette, however, the measures behave like probabilities and can be used to investigate an individual’s tendency to belong to a cluster. We also suggest two possible ways to evaluate the hard partition using these measures. We evaluate the performance of both measures in individuals with ambiguous cluster membership, using simulated binary datasets that have been partitioned by the PAM algorithm or continuous datasets that have been partitioned by hierarchical clustering and k-means clustering. For comparison, we also present results from soft-clustering algorithms such as soft analysis clustering (FANNY) and two model-based clustering methods. Our proposed measures perform comparably to the posterior probability estimators from either FANNY or the model-based clustering methods. We also illustrate the proposed measures by applying them to Fisher’s classic dataset on irises. 相似文献

12.

Comparison of clustering algorithms on generalized propensity score in observational studies: a simulation study

Chunhao Tu Shuo Jiao Woon Yuen Koh 《Journal of Statistical Computation and Simulation》2013,83(12):2206-2218

In observational studies, unbalanced observed covariates between treatment groups often cause biased inferences on the estimation of treatment effects. Recently, generalized propensity score (GPS) has been proposed to overcome this problem; however, a practical technique to apply the GPS is lacking. This study demonstrates how clustering algorithms can be used to group similar subjects based on transformed GPS. We compare four popular clustering algorithms: k-means clustering (KMC), model-based clustering, fuzzy c-means clustering and partitioning around medoids based on the following three criteria: average dissimilarity between subjects within clusters, average Dunn index and average silhouette width under four various covariate scenarios. Simulation studies show that the KMC algorithm has overall better performance compared with the other three clustering algorithms. Therefore, we recommend using the KMC algorithm to group similar subjects based on the transformed GPS. 相似文献

13.

Mixtures of general location model with factor analyzer covariance structure for clustering mixed type data

Leila Amiri Mojtaba Ganjali 《Journal of applied statistics》2019,46(11):2075-2100

Cluster analysis is one of the most widely used method in statistical analyses, in which homogeneous subgroups are identified in a heterogeneous population. Due to the existence of the continuous and discrete mixed data in many applications, so far, some ordinary clustering methods such as, hierarchical methods, k-means and model-based methods have been extended for analysis of mixed data. However, in the available model-based clustering methods, by increasing the number of continuous variables, the number of parameters increases and identifying as well as fitting an appropriate model may be difficult. In this paper, to reduce the number of the parameters, for the model-based clustering mixed data of continuous (normal) and nominal data, a set of parsimonious models is introduced. Models in this set are extended, using the general location model approach, for modeling distribution of mixed variables and applying factor analyzer structure for covariance matrices. The ECM algorithm is used for estimating the parameters of these models. In order to show the performance of the proposed models for clustering, results from some simulation studies and analyzing two real data sets are presented. 相似文献

14.

Classification performance resulting from a 2-means

C. Ruwet G. Haesbroeck 《Journal of statistical planning and inference》2013

The k-means procedure is probably one of the most common nonhierachical clustering techniques. From a theoretical point of view, it is related to the search for the k principal points of the underlying distribution. In this paper, the classification resulting from that procedure for k=2 is shown to be optimal under a balanced mixture of two spherically symmetric and homoscedastic distributions. Then, the classification efficiency of the 2-means rule is assessed using the second order influence function and compared to the classification efficiencies of Fisher and Logistic discriminations. Influence functions are also considered here to compare the robustness to infinitesimal contamination of the 2-means method w.r.t. the generalized 2-means technique. 相似文献

15.

Representative points for location-biased datasets

Zong-Feng Qi Kai-Tai Fang 《统计学通讯:模拟与计算》2019,48(2):458-471

Representative points (RPs) are a set of points that optimally represents a distribution in terms of mean square error. When the prior data is location biased, the direct methods such as the k-means algorithm may be inefficient to obtain the RPs. In this article, a new indirect algorithm is proposed to search the RPs based on location-biased datasets. Such an algorithm does not constrain the parameter model of the true distribution. The empirical study shows that such algorithm can obtain better RPs than the k-means algorithm. 相似文献

16.

A Permutation Based Procedure for Classification Assessment

《统计学通讯:理论与方法》2012,41(16-17):3126-3137

This article proposes a permutation procedure for evaluating the performance of different classification methods. In particular, we focus on two of the most widespread and used classification methodologies: latent class analysis and k-means clustering. The classification performance is assessed by means of a permutation procedure which allows for a direct comparison of the methods, the development of a statistical test, and points out better potential solutions. Our proposal provides an innovative framework for the validation of the data partitioning and offers a guide in the choice of which classification procedure should be used 相似文献

17.

Weighting variables in K-means clustering

Myung-Hoe Huh 《Journal of applied statistics》2009,36(1):67-78

The aim of this study is to assign weights w ₁, …, w _m to m clustering variables Z ₁, …, Z _m, so that k groups were uncovered to reveal more meaningful within-group coherence. We propose a new criterion to be minimized, which is the sum of the weighted within-cluster sums of squares and the penalty for the heterogeneity in variable weights w ₁, …, w _m. We will present the computing algorithm for such k-means clustering, a working procedure to determine a suitable value of penalty constant and numerical examples, among which one is simulated and the other two are real. 相似文献

18.

Rounding non-binary categorical variables following multivariate normal imputation: evaluation of simple methods and implications for practice

《Journal of Statistical Computation and Simulation》2012,82(4):798-811

We study bias arising from rounding categorical variables following multivariate normal (MVN) imputation. This task has been well studied for binary variables, but not for more general categorical variables. Three methods that assign imputed values to categories based on fixed reference points are compared using 25 specific scenarios covering variables with k=3, …, 7 categories, and five distributional shapes, and for each k=3, …, 7, we examine the distribution of bias arising over 100,000 distributions drawn from a symmetric Dirichlet distribution. We observed, on both empirical and theoretical grounds, that one method (projected-distance-based rounding) is superior to the other two methods, and that the risk of invalid inference with the best method may be too high at sample sizes n≥150 at 50% missingness, n≥250 at 30% missingness and n≥1500 at 10% missingness. Therefore, these methods are generally unsatisfactory for rounding categorical variables (with up to seven categories) following MVN imputation. 相似文献

19.

Semiparametric inference for estimating equations with nonignorably missing covariates

Ji Chen Fang Fang 《Journal of nonparametric statistics》2018,30(3):796-812

We consider statistical inference of unknown parameters in estimating equations (EEs) when some covariates have nonignorably missing values, which is quite common in practice but has rarely been discussed in the literature. When an instrument, a fully observed covariate vector that helps identifying parameters under nonignorable missingness, is available, the conditional distribution of the missing covariates given other covariates can be estimated by the pseudolikelihood method of Zhao and Shao [(2015), ‘Semiparametric pseudo likelihoods in generalised linear models with nonignorable missing data’, Journal of the American Statistical Association, 110, 1577–1590)] and be used to construct unbiased EEs. These modified EEs then constitute a basis for valid inference by empirical likelihood. Our method is applicable to a wide range of EEs used in practice. It is semiparametric since no parametric model for the propensity of missing covariate data is assumed. Asymptotic properties of the proposed estimator and the empirical likelihood ratio test statistic are derived. Some simulation results and a real data analysis are presented for illustration. 相似文献

20.

Bayesian nonparametric clustering for large data sets

Zuanetti Daiane Aparecida Müller Peter Zhu Yitan Yang Shengjie Ji Yuan 《Statistics and Computing》2019,29(2):203-215

We propose two nonparametric Bayesian methods to cluster big data and apply them to cluster genes by patterns of gene–gene interaction. Both approaches define model-based clustering with nonparametric Bayesian priors and include an implementation that remains feasible for big data. The first method is based on a predictive recursion which requires a single cycle (or few cycles) of simple deterministic calculations for each observation under study. The second scheme is an exact method that divides the data into smaller subsamples and involves local partitions that can be determined in parallel. In a second step, the method requires only the sufficient statistics of each of these local clusters to derive global clusters. Under simulated and benchmark data sets the proposed methods compare favorably with other clustering algorithms, including k-means, DP-means, DBSCAN, SUGS, streaming variational Bayes and an EM algorithm. We apply the proposed approaches to cluster a large data set of gene–gene interactions extracted from the online search tool “Zodiac.”

相似文献