首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
One of the most popular methods and algorithms to partition data to k clusters is k-means clustering algorithm. Since this method relies on some basic conditions such as, the existence of mean and finite variance, it is unsuitable for data that their variances are infinite such as data with heavy tailed distribution. Pitman Measure of Closeness (PMC) is a criterion to show how much an estimator is close to its parameter with respect to another estimator. In this article using PMC, based on k-means clustering, a new distance and clustering algorithm is developed for heavy tailed data.  相似文献   

2.
The k-means algorithm is one of the most common non hierarchical methods of clustering. It aims to construct clusters in order to minimize the within cluster sum of squared distances. However, as most estimators defined in terms of objective functions depending on global sums of squares, the k-means procedure is not robust with respect to atypical observations in the data. Alternative techniques have thus been introduced in the literature, e.g., the k-medoids method. The k-means and k-medoids methodologies are particular cases of the generalized k-means procedure. In this article, focus is on the error rate these clustering procedures achieve when one expects the data to be distributed according to a mixture distribution. Two different definitions of the error rate are under consideration, depending on the data at hand. It is shown that contamination may make one of these two error rates decrease even under optimal models. The consequence of this will be emphasized with the comparison of influence functions and breakdown points of these error rates.  相似文献   

3.
Clustering algorithms like types of k-means are fast, but they are inefficient for shape clustering. There are some algorithms, which are effective, but their time complexities are too high. This paper proposes a novel heuristic to solve large-scale shape clustering. The proposed method is effective and it solves large-scale clustering problems in fraction of a second.  相似文献   

4.
Representative points (RPs) are a set of points that optimally represents a distribution in terms of mean square error. When the prior data is location biased, the direct methods such as the k-means algorithm may be inefficient to obtain the RPs. In this article, a new indirect algorithm is proposed to search the RPs based on location-biased datasets. Such an algorithm does not constrain the parameter model of the true distribution. The empirical study shows that such algorithm can obtain better RPs than the k-means algorithm.  相似文献   

5.
k-POD: A Method for k-Means Clustering of Missing Data   总被引:1,自引:0,他引:1  
The k-means algorithm is often used in clustering applications but its usage requires a complete data matrix. Missing data, however, are common in many applications. Mainstream approaches to clustering missing data reduce the missing data problem to a complete data formulation through either deletion or imputation but these solutions may incur significant costs. Our k-POD method presents a simple extension of k-means clustering for missing data that works even when the missingness mechanism is unknown, when external information is unavailable, and when there is significant missingness in the data.

[Received November 2014. Revised August 2015.]  相似文献   

6.
The k-means procedure is probably one of the most common nonhierachical clustering techniques. From a theoretical point of view, it is related to the search for the k principal points of the underlying distribution. In this paper, the classification resulting from that procedure for k=2 is shown to be optimal under a balanced mixture of two spherically symmetric and homoscedastic distributions. Then, the classification efficiency of the 2-means rule is assessed using the second order influence function and compared to the classification efficiencies of Fisher and Logistic discriminations. Influence functions are also considered here to compare the robustness to infinitesimal contamination of the 2-means method w.r.t. the generalized 2-means technique.  相似文献   

7.
We introduce the concept of snipping, complementing that of trimming, in robust cluster analysis. An observation is snipped when some of its dimensions are discarded, but the remaining are used for clustering and estimation. Snipped k-means is performed through a probabilistic optimization algorithm which is guaranteed to converge to the global optimum. We show global robustness properties of our snipped k-means procedure. Simulations and a real data application to optical recognition of handwritten digits are used to illustrate and compare the approach.  相似文献   

8.
We propose two probability-like measures of individual cluster-membership certainty that can be applied to a hard partition of the sample such as that obtained from the partitioning around medoids (PAM) algorithm, hierarchical clustering or k-means clustering. One measure extends the individual silhouette widths and the other is obtained directly from the pairwise dissimilarities in the sample. Unlike the classic silhouette, however, the measures behave like probabilities and can be used to investigate an individual’s tendency to belong to a cluster. We also suggest two possible ways to evaluate the hard partition using these measures. We evaluate the performance of both measures in individuals with ambiguous cluster membership, using simulated binary datasets that have been partitioned by the PAM algorithm or continuous datasets that have been partitioned by hierarchical clustering and k-means clustering. For comparison, we also present results from soft-clustering algorithms such as soft analysis clustering (FANNY) and two model-based clustering methods. Our proposed measures perform comparably to the posterior probability estimators from either FANNY or the model-based clustering methods. We also illustrate the proposed measures by applying them to Fisher’s classic dataset on irises.  相似文献   

9.
The aim of this study is to assign weights w 1, …, w m to m clustering variables Z 1, …, Z m , so that k groups were uncovered to reveal more meaningful within-group coherence. We propose a new criterion to be minimized, which is the sum of the weighted within-cluster sums of squares and the penalty for the heterogeneity in variable weights w 1, …, w m . We will present the computing algorithm for such k-means clustering, a working procedure to determine a suitable value of penalty constant and numerical examples, among which one is simulated and the other two are real.  相似文献   

10.
Contours may be viewed as the 2D outline of the image of an object. This type of data arises in medical imaging as well as in computer vision and can be modeled as data on a manifold and can be studied using statistical shape analysis. Practically speaking, each observed contour, while theoretically infinite dimensional, must be discretized for computations. As such, the coordinates for each contour as obtained at k sampling times, resulting in the contour being represented as a k-dimensional complex vector. While choosing large values of k will result in closer approximations to the original contour, this will also result in higher computational costs in the subsequent analysis. The goal of this study is to determine reasonable values for k so as to keep the computational cost low while maintaining accuracy. To do this, we consider two methods for selecting sample points and determine lower bounds for k for obtaining a desired level of approximation error using two different criteria. Because this process is computationally inefficient to perform on a large scale, we then develop models for predicting the lower bounds for k based on simple characteristics of the contours.  相似文献   

11.
《统计学通讯:理论与方法》2012,41(16-17):3126-3137
This article proposes a permutation procedure for evaluating the performance of different classification methods. In particular, we focus on two of the most widespread and used classification methodologies: latent class analysis and k-means clustering. The classification performance is assessed by means of a permutation procedure which allows for a direct comparison of the methods, the development of a statistical test, and points out better potential solutions. Our proposal provides an innovative framework for the validation of the data partitioning and offers a guide in the choice of which classification procedure should be used  相似文献   

12.
In observational studies, unbalanced observed covariates between treatment groups often cause biased inferences on the estimation of treatment effects. Recently, generalized propensity score (GPS) has been proposed to overcome this problem; however, a practical technique to apply the GPS is lacking. This study demonstrates how clustering algorithms can be used to group similar subjects based on transformed GPS. We compare four popular clustering algorithms: k-means clustering (KMC), model-based clustering, fuzzy c-means clustering and partitioning around medoids based on the following three criteria: average dissimilarity between subjects within clusters, average Dunn index and average silhouette width under four various covariate scenarios. Simulation studies show that the KMC algorithm has overall better performance compared with the other three clustering algorithms. Therefore, we recommend using the KMC algorithm to group similar subjects based on the transformed GPS.  相似文献   

13.
Consider k( k ≥ 1) independent Weibull populations and a control population which is also Weibull. The problem of identifying which of these k populations are better than the control using shape parameter as a criterion is considered. We allow the possibility of making at most m(0 ≤ m < k) incorrect identifications of better populations. This allowance results in significant savings in sample size. Procedures based on simple linear unbiased estimators of the reciprocal of the shape parameters of these populations are proposed. These procedures can be used for both complete and Type II-censored samples. A related problem of confidence intervals for the ratio of ordered shape parameters is also considered. Monte Carlo simulations as well as both chi-square and normal approximations to the solutions are obtained.  相似文献   

14.
In the social science disciplines, the assumption that the data stem from a single homogeneous population is often unrealistic in respect of empirical research. When applying a causal modeling approach, such as partial least squares path modeling, segmentation is a key issue in coping with the problem of heterogeneity in the estimated cause–effect relationships. This article uses the novel finite-mixture partial least squares (FIMIX-PLS) method to uncover unobserved heterogeneity in a complex path modeling example in the field of marketing. An evaluation of the results includes a comparison with the outcomes of several data analysis strategies based on a priori information or k-means cluster analysis. The results of this article underpin the effectiveness and the advantageous capabilities of FIMIX-PLS in general PLS path model set-ups by means of empirical data and formative as well as reflective measurement models. Consequently, this research substantiates the general applicability of FIMIX-PLS to path modeling as a standard means of evaluating PLS results by addressing the problem of unobserved heterogeneity.  相似文献   

15.
Euclidean distance k-nearest neighbor (k-NN) classifiers are simple nonparametric classification rules. Bootstrap methods, widely used for estimating the expected prediction error of classification rules, are motivated by the objective of calculating the ideal bootstrap estimate of expected prediction error. In practice, bootstrap methods use Monte Carlo resampling to estimate the ideal bootstrap estimate because exact calculation is generally intractable. In this article, we present analytical formulae for exact calculation of the ideal bootstrap estimate of expected prediction error for k-NN classifiers and propose a new weighted k-NN classifier based on resampling ideas. The resampling-weighted k-NN classifier replaces the k-NN posterior probability estimates by their expectations under resampling and predicts an unclassified covariate as belonging to the group with the largest resampling expectation. A simulation study and an application involving remotely sensed data show that the resampling-weighted k-NN classifier compares favorably to unweighted and distance-weighted k-NN classifiers.  相似文献   

16.
Abstract

K-means inverse regression was developed as an easy-to-use dimension reduction procedure for multivariate regression. This approach is similar to the original sliced inverse regression method, with the exception that the slices are explicitly produced by a K-means clustering of the response vectors. In this article, we propose K-medoids clustering as an alternative clustering approach for slicing and compare its performance to K-means in a simulation study. Although the two methods often produce comparable results, K-medoids tends to yield better performance in the presence of outliers. In addition to isolation of outliers, K-medoids clustering also has the advantage of accommodating a broader range of dissimilarity measures, which could prove useful in other graphical regression applications where slicing is required.  相似文献   

17.
The reconstruction of phylogenetic trees is one of the most important and interesting problems of the evolutionary study. There are many methods proposed in the literature for constructing phylogenetic trees. Each approach is based on different criteria and evolutionary models. However, the topologies of trees constructed from different methods may be quite different. The topological errors may be due to unsuitable criterions or evolutionary models. Since there are many tree construction approaches, we are interested in selecting a better tree to fit the true model. In this study, we propose an adjusted k-means approach and a misclassification error score criterion to solve the problem. The simulation study shows this method can select better trees among the potential candidates, which can provide a useful way in phylogenetic tree selection.  相似文献   

18.
ABSTRACT

Cylindrical data are bivariate data from the combination of circular and linear variables. However, up to now no work has been done on the detection of outlier in cylindrical data. We introduce a definition of outlier for cylindrical data and present a new test of discordancy to detect outlier in this type of data, based on the k-nearest neighbor’s distance. Cut-off points of the new test statistic based on the Johnson-Wehrly distribution are calculated and its performance is examined using simulation. A practical example is presented using wind speed and wind direction data obtained from the Malaysian Meteorological Department.  相似文献   

19.
The two-sample problem and its extension to the k-sample problem are well known in the statistical literature. However, the discrete version of the k-sample problem is relatively less explored. Here in this work we suggest a k-sample non-parametric test procedure for discrete distributions based on mutual information. A detailed power study with comparison with other alternatives is provided. Finally, a comparison of some English soccer league teams based on their goal-scoring pattern is discussed.  相似文献   

20.
We propose the L1 distance between the distribution of a binned data sample and a probability distribution from which it is hypothetically drawn as a statistic for testing agreement between the data and a model. We study the distribution of this distance for N-element samples drawn from k bins of equal probability and derive asymptotic formulae for the mean and dispersion of L1 in the large-N limit. We argue that the L1 distance is asymptotically normally distributed, with the mean and dispersion being accurately reproduced by asymptotic formulae even for moderately large values of N and k.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号