首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
The K-means clustering method is a widely adopted clustering algorithm in data mining and pattern recognition, where the partitions are made by minimizing the total within group sum of squares based on a given set of variables. Weighted K-means clustering is an extension of the K-means method by assigning nonnegative weights to the set of variables. In this paper, we aim to obtain more meaningful and interpretable clusters by deriving the optimal variable weights for weighted K-means clustering. Specifically, we improve the weighted k-means clustering method by introducing a new algorithm to obtain the globally optimal variable weights based on the Karush-Kuhn-Tucker conditions. We present the mathematical formulation for the clustering problem, derive the structural properties of the optimal weights, and implement an recursive algorithm to calculate the optimal weights. Numerical examples on simulated and real data indicate that our method is superior in both clustering accuracy and computational efficiency.  相似文献   

2.
k-POD: A Method for k-Means Clustering of Missing Data   总被引:1,自引:0,他引:1  
The k-means algorithm is often used in clustering applications but its usage requires a complete data matrix. Missing data, however, are common in many applications. Mainstream approaches to clustering missing data reduce the missing data problem to a complete data formulation through either deletion or imputation but these solutions may incur significant costs. Our k-POD method presents a simple extension of k-means clustering for missing data that works even when the missingness mechanism is unknown, when external information is unavailable, and when there is significant missingness in the data.

[Received November 2014. Revised August 2015.]  相似文献   

3.
Silhouette information evaluates the quality of the partition detected by a clustering technique. Since it is based on a measure of distance between the clustered observations, its standard formulation is not adequate when a density-based clustering technique is used. In this work we propose a suitable modification of the Silhouette information aimed at evaluating the quality of clusters in a density-based framework. It is based on the estimation of the data posterior probabilities of belonging to the clusters and may be used to measure our confidence about data allocation to the clusters as well as to choose the best partition among different ones.  相似文献   

4.
Constrained optimization is proposed as a practical solution to the problem of estimating a distribution function at each point in a given set from monotone sequences of upper and lower bounds. The proposed solution employs least absolute value estimation and, hence, has a linear programming formulation. The special structure inherent in this formulation is exploited and an efficient computational method is discussed. The procedure is illustrated by two examples.  相似文献   

5.
Summary. We present a decision theoretic formulation of product partition models (PPMs) that allows a formal treatment of different decision problems such as estimation or hypothesis testing and clustering methods simultaneously. A key observation in our construction is the fact that PPMs can be formulated in the context of model selection. The underlying partition structure in these models is closely related to that arising in connection with Dirichlet processes. This allows a straightforward adaptation of some computational strategies—originally devised for nonparametric Bayesian problems—to our framework. The resulting algorithms are more flexible than other competing alternatives that are used for problems involving PPMs. We propose an algorithm that yields Bayes estimates of the quantities of interest and the groups of experimental units. We explore the application of our methods to the detection of outliers in normal and Student t regression models, with clustering structure equivalent to that induced by a Dirichlet process prior. We also discuss the sensitivity of the results considering different prior distributions for the partitions.  相似文献   

6.
Although Hartigan (1975) had already put forward the idea of connecting identification of subpopulations with regions with high density of the underlying probability distribution, the actual development of methods for cluster analysis has largely shifted towards other directions, for computational convenience. Current computational resources allow us to reconsider this formulation and to develop clustering techniques directly in order to identify local modes of the density. Given a set of observations, a nonparametric estimate of the underlying density function is constructed, and subsets of points with high density are formed through suitable manipulation of the associated Delaunay triangulation. The method is illustrated with some numerical examples.  相似文献   

7.
The study of spatial variations in disease rates is a common epidemiological approach used to describe the geographical clustering of diseases and to generate hypotheses about the possible 'causes' which could explain apparent differences in risk. Recent statistical and computational developments have led to the use of realistically complex models to account for overdispersion and spatial correlation. However, these developments have focused almost exclusively on spatial modelling of a single disease. Many diseases share common risk factors (smoking being an obvious example) and, if similar patterns of geographical variation of related diseases can be identified, this may provide more convincing evidence of real clustering in the underlying risk surface. We propose a shared component model for the joint spatial analysis of two diseases. The key idea is to separate the underlying risk surface for each disease into a shared and a disease-specific component. The various components of this formulation are modelled simultaneously by using spatial cluster models implemented via reversible jump Markov chain Monte Carlo methods. We illustrate the methodology through an analysis of oral and oesophageal cancer mortality in the 544 districts of Germany, 1986–1990.  相似文献   

8.
高海燕等 《统计研究》2020,37(8):91-103
函数型聚类分析算法涉及投影和聚类两个基本要素。通常,最优投影结果未必能够有效地保留类别信息,从而影响后续聚类效果。为此,本文梳理了函数型聚类的构成要素及运行过程;借助非负矩阵分解的聚类特性,提出了基于非负矩阵分解的函数型聚类算法,构建了“投影与聚类”并行的实现框架,并采用交替迭代方法更新求解,分析了算法的计算时间复杂度。针对随机模拟数据验证和语音识别数据的实例检验结果显示,该函数型聚类算法有助于提高聚类效果;针对北京市二氧化氮(NO2)污染物小时浓度数据的实例应用表明,该函数型聚类算法对空气质量监测点类型的区分能够充分识别站点布局的空间模式,具有良好的实际应用价值。  相似文献   

9.
Spectral clustering uses eigenvectors of the Laplacian of the similarity matrix. It is convenient to solve binary clustering problems. When applied to multi-way clustering, either the binary spectral clustering is recursively applied or an embedding to spectral space is done and some other methods, such as K-means clustering, are used to cluster the points. Here we propose and study a K-way clustering algorithm – spectral modular transformation, based on the fact that the graph Laplacian has an equivalent representation, which has a diagonal modular structure. The method first transforms the original similarity matrix into a new one, which is nearly disconnected and reveals a cluster structure clearly, then we apply linearized cluster assignment algorithm to split the clusters. In this way, we can find some samples for each cluster recursively using the divide and conquer method. To get the overall clustering results, we apply the cluster assignment obtained in the previous step as the initialization of multiplicative update method for spectral clustering. Examples show that our method outperforms spectral clustering using other initializations.  相似文献   

10.
Reduced k‐means clustering is a method for clustering objects in a low‐dimensional subspace. The advantage of this method is that both clustering of objects and low‐dimensional subspace reflecting the cluster structure are simultaneously obtained. In this paper, the relationship between conventional k‐means clustering and reduced k‐means clustering is discussed. Conditions ensuring almost sure convergence of the estimator of reduced k‐means clustering as unboundedly increasing sample size have been presented. The results for a more general model considering conventional k‐means clustering and reduced k‐means clustering are provided in this paper. Moreover, a consistent selection of the numbers of clusters and dimensions is described.  相似文献   

11.
Density-based clustering methods hinge on the idea of associating groups to the connected components of the level sets of the density underlying the data, to be estimated by a nonparametric method. These methods claim some desirable properties and generally good performance, but they involve a non-trivial computational effort, required for the identification of the connected regions. In a previous work, the use of spatial tessellation such as the Delaunay triangulation has been proposed, because it suitably generalizes the univariate procedure for detecting the connected components. However, its computational complexity grows exponentially with the dimensionality of data, thus making the triangulation unfeasible for high dimensions. Our aim is to overcome the limitations of Delaunay triangulation. We discuss the use of an alternative procedure for identifying the connected regions associated to the level sets of the density. By measuring the extent of possible valleys of the density along the segment connecting pairs of observations, the proposed procedure shifts the formulation from a space with arbitrary dimension to a univariate one, thus leading benefits both in computation and visualization.  相似文献   

12.
Multi-level models can be used to account for clustering in data from multi-stage surveys. In some cases, the intraclass correlation may be close to zero, so that it may seem reasonable to ignore clustering and fit a single-level model. This article proposes several adaptive strategies for allowing for clustering in regression analysis of multi-stage survey data. The approach is based on testing whether the PSU-level variance component is zero. If this hypothesis is retained, then variance estimates are calculated ignoring clustering; otherwise, clustering is reflected in variance estimation. A simple simulation study is used to evaluate the various procedures.  相似文献   

13.
函数数据聚类分析方法探析   总被引:3,自引:0,他引:3  
函数数据是目前数据分析中新出现的一种数据类型,它同时具有时间序列和横截面数据的特征,通常可以描述为关于某一变量的函数图像,在实际应用中具有很强的实用性。首先简要分析函数数据的一些基本特征和目前提出的一些函数数据聚类方法,如均匀修正的函数数据K均值聚类方法、函数数据层次聚类方法等,并在此基础上,从函数特征分析的角度探讨了函数数据聚类方法,提出了一种基于导数分析的函数数据区间聚类分析方法,并利用中国中部六省的就业人口数据对该方法进行实证分析,取得了聚类结果。  相似文献   

14.
Cluster analysis is a popular statistics and computer science technique commonly used in various areas of research. In this article, we investigate factors that can influence clustering performance in the model-based clustering framework. The four factors considered are the level of overlap, number of clusters, number of dimensions, and sample size. Through a comprehensive simulation study, we investigate model-based clustering in different settings. As a measure of clustering performance, we employ three popular classification indices capable of reflecting the degree of agreement in two partitioning vectors, thus making the comparison between the true and estimated classification vectors possible. In addition to studying clustering complexity, the performance of the three classification measures is evaluated.  相似文献   

15.
在面板数据聚类分析方法的研究中,基于面板数据兼具截面维度和时间维度的特征,对欧氏距离函数进行了改进,在聚类过程中考虑指标权重与时间权重,提出了适用于面板数据聚类分析的"加权距离函数"以及相应的Ward.D聚类方法。首先定义了考虑指标绝对值、邻近时点增长率以及波动变异程度的欧氏距离函数;然后,将指标权重与时间权重通过线性模型集结成综合加权距离,最终实现面板数据的加权聚类过程。实证分析结果显示,考虑指标权重与时间权重的面板数据加权聚类分析方法具有更好的分辨能力,能提高样本聚类的准确性。  相似文献   

16.
Functional data analysis (FDA)—the analysis of data that can be considered a set of observed continuous functions—is an increasingly common class of statistical analysis. One of the most widely used FDA methods is the cluster analysis of functional data; however, little work has been done to compare the performance of clustering methods on functional data. In this article, a simulation study compares the performance of four major hierarchical methods for clustering functional data. The simulated data varied in three ways: the nature of the signal functions (periodic, non periodic, or mixed), the amount of noise added to the signal functions, and the pattern of the true cluster sizes. The Rand index was used to compare the performance of each clustering method. As a secondary goal, clustering methods were also compared when the number of clusters has been misspecified. To illustrate the results, a real set of functional data was clustered where the true clustering structure is believed to be known. Comparing the clustering methods for the real data set confirmed the findings of the simulation. This study yields concrete suggestions to future researchers to determine the best method for clustering their functional data.  相似文献   

17.
We propose an intuitive and computationally simple algorithm for clustering the probability density functions (pdfs). A data-driven learning mechanism is incorporated in the algorithm in order to determine the suitable widths of the clusters. The clustering results prove that the proposed algorithm is able to automatically group the pdfs and provide the optimal cluster number without any a priori information. The performance study also shows that the proposed algorithm is more efficient than existing ones. In addition, the clustering can serve as the intermediate compression tool in content-based multimedia retrieval that we apply the proposed algorithm to categorize a subset of COREL image database. And the clustering results indicate that the proposed algorithm performs well in colour image categorization.  相似文献   

18.
A nonparametric test for the presence of clustering in survival data is proposed. Assuming a model that incorporates the clustering effect into the Cox Proportional Hazards model, simulation studies indicate that the procedure is correctly sized and powerful in a reasonably wide range of scenarios. The test for the presence of clustering over time is also robust to model misspecification. With large number of clusters, the test is powerful even if the data is highly heterogeneous.  相似文献   

19.
ABSTRACT

This paper considers adaptation of hierarchical models for small area disease counts to detect disease clustering. A high risk area may be an outlier (in local terms) if surrounded by low risk areas, whereas a high risk cluster requires that both the focus area and surrounding areas demonstrate common elevated risk. A local join count method is suggested to detect local clustering of high disease risk in a single health outcome, and extends to assessing bivariate spatial clustering in relative risk. Applications include assessing spatial heterogeneity in effects of area predictors according to local clustering configuration, and gauging sensitivity of bivariate clustering to random effect assumptions.  相似文献   

20.
Mixtures of factor analyzers is a useful model-based clustering method which can avoid the curse of dimensionality in high-dimensional clustering. However, this approach is sensitive to both diverse non-normalities of marginal variables and outliers, which are commonly observed in multivariate experiments. We propose mixtures of Gaussian copula factor analyzers (MGCFA) for clustering high-dimensional clustering. This model has two advantages; (1) it allows different marginal distributions to facilitate fitting flexibility of the mixture model, (2) it can avoid the curse of dimensionality by embedding the factor-analytic structure in the component-correlation matrices of the mixture distribution.An EM algorithm is developed for the fitting of MGCFA. The proposed method is free of the curse of dimensionality and allows any parametric marginal distribution which fits best to the data. It is applied to both synthetic data and a microarray gene expression data for clustering and shows its better performance over several existing methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号