首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Abstract

Cluster analysis is the distribution of objects into different groups or more precisely the partitioning of a data set into subsets (clusters) so that the data in subsets share some common trait according to some distance measure. Unlike classification, in clustering one has to first decide the optimum number of clusters and then assign the objects into different clusters. Solution of such problems for a large number of high dimensional data points is quite complicated and most of the existing algorithms will not perform properly. In the present work a new clustering technique applicable to large data set has been used to cluster the spectra of 702248 galaxies and quasars having 1,540 points in wavelength range imposed by the instrument. The proposed technique has successfully discovered five clusters from this 702,248X1,540 data matrix.  相似文献   

2.
Cluster analysis is an important technique of explorative data mining. It refers to a collection of statistical methods for learning the structure of data by solely exploring pairwise distances or similarities. Often meaningful structures are not detectable in these high-dimensional feature spaces. Relevant features can be obfuscated by noise from irrelevant measurements. These observations led to the design of subspace clustering algorithms, which can identify clusters that originate from different subsets of features. Hunting for clusters in arbitrary subspaces is intractable due to the infinite search space spanned by all feature combinations. In this work, we present a subspace clustering algorithm that can be applied for exhaustively screening all feature combinations of small- or medium-sized datasets (approximately 30 features). Based on a robustness analysis via subsampling we are able to identify a set of stable candidate subspace cluster solutions.  相似文献   

3.
函数型数据的稀疏性和无穷维特性使得传统聚类分析失效。针对此问题,本文在界定函数型数据概念与内涵的基础上提出了一种自适应迭代更新聚类分析。首先,基于数据参数信息实现无穷维函数空间向有限维多元空间的过渡;在此基础上,依据变量信息含量的差异构建了自适应赋权聚类统计量,并依此为函数型数据的相似性测度进行初始类别划分;进一步地,在给定阈值限制下,对所有函数的初始类别归属进行自适应迭代更新,将收敛的优化结果作为最终的类别划分。随机模拟和实证检验表明,与现有的同类函数型聚类分析相比,文中方法的分类正确率显著提高,体现了新方法的相对优良性和实际问题应用中的有效性。  相似文献   

4.
This article focuses on the clustering problem based on Dirichlet process (DP) mixtures. To model both time invariant and temporal patterns, different from other existing clustering methods, the proposed semi-parametric model is flexible in that both the common and unique patterns are taken into account simultaneously. Furthermore, by jointly clustering subjects and the associated variables, the intrinsic complex shared patterns among subjects and among variables are expected to be captured. The number of clusters and cluster assignments are directly inferred with the use of DP. Simulation studies illustrate the effectiveness of the proposed method. An application to wheal size data is discussed with an aim of identifying novel temporal patterns among allergens within subject clusters.  相似文献   

5.
基于遗传算法的投影寻踪聚类   总被引:1,自引:0,他引:1  
传统的投影寻踪聚类算法PROCLUS是一种有效的处理高维数据聚类的算法,但此算法是利用爬山法(Hill climbing)对各类中心点进行循环迭代、选取最优的过程,由于爬山法是一种局部搜索(local search)方法,得到的最优解可能仅仅是局部最优。针对上述缺陷,提出一种改进的投影寻踪聚类算法,即利用遗传算法(Genetic Algorithm)对各类中心点进行循环迭代,寻找到全局最优解。仿真实验结果证明了新算法的可行性和有效性。  相似文献   

6.
数据分布密度划分的聚类算法是数据挖掘聚类算法的主要方法之一。针对传统密度划分聚类算法存在运算复杂、运行效率不高等缺陷,设计高维分步投影的多重分区聚类算法;以高维分布投影密度为依据,对数据集进行多重分区,产生数据集的子簇空间,并进行子簇合并,形成理想的聚类结果;依据该算法进行实验,结果证明该算法具有运算简单和运行效率高等优良性。  相似文献   

7.
This study develops a robust automatic algorithm for clustering probability density functions based on the previous research. Unlike other existing methods that often pre-determine the number of clusters, this method can self-organize data groups based on the original data structure. The proposed clustering method is also robust in regards to noise. Three examples of synthetic data and a real-world COREL dataset are utilized to illustrate the accurateness and effectiveness of the proposed approach.  相似文献   

8.
We consider n individuals described by p variables, represented by points of the surface of unit hypersphere. We suppose that the individuals are fixed and the set of variables comes from a mixture of bipolar Watson distributions. For the mixture identification, we use EM and dynamic clusters algorithms, which enable us to obtain a partition of the set of variables into clusters of variables.

Our aim is to evaluate the clusters obtained in these algorithms, using measures of within-groups variability and between-groups variability and compare these clusters with those obtained in other clustering approaches, by analyzing simulated and real data.  相似文献   

9.
This paper suggests an evolving possibilistic approach for fuzzy modelling of time-varying processes. The approach is based on an extension of the well-known possibilistic fuzzy c-means (FCM) clustering and functional fuzzy rule-based modelling. Evolving possibilistic fuzzy modelling (ePFM) employs memberships and typicalities to recursively cluster data, and uses participatory learning to adapt the model structure as a stream data is input. The idea of possibilistic clustering plays a key role when the data are noisy and with outliers due to the relaxation of the restriction on membership degrees to add up unity in FCM clustering algorithm. To show the usefulness of ePFM, the approach is addressed for system identification using Box & Jenkins gas furnace data as well as time series forecasting considering the chaotic Mackey–Glass series and data produced by a synthetic time-varying process with parameter drift. The results show that ePFM is a potential candidate for nonlinear time-varying systems modelling, with comparable or better performance than alternative approaches, mainly when noise and outliers affect the data available.  相似文献   

10.
This paper addresses the problem of identifying groups that satisfy the specific conditions for the means of feature variables. In this study, we refer to the identified groups as “target clusters” (TCs). To identify TCs, we propose a method based on the normal mixture model (NMM) restricted by a linear combination of means. We provide an expectation–maximization (EM) algorithm to fit the restricted NMM by using the maximum-likelihood method. The convergence property of the EM algorithm and a reasonable set of initial estimates are presented. We demonstrate the method's usefulness and validity through a simulation study and two well-known data sets. The proposed method provides several types of useful clusters, which would be difficult to achieve with conventional clustering or exploratory data analysis methods based on the ordinary NMM. A simple comparison with another target clustering approach shows that the proposed method is promising in the identification.  相似文献   

11.
A new method for constructing interpretable principal components is proposed. The method first clusters the variables, and then interpretable (sparse) components are constructed from the correlation matrices of the clustered variables. For the first step of the method, a new weighted-variances method for clustering variables is proposed. It reflects the nature of the problem that the interpretable components should maximize the explained variance and thus provide sparse dimension reduction. An important feature of the new clustering procedure is that the optimal number of clusters (and components) can be determined in a non-subjective manner. The new method is illustrated using well-known simulated and real data sets. It clearly outperforms many existing methods for sparse principal component analysis in terms of both explained variance and sparseness.  相似文献   

12.
Compared to tests for localized clusters, the tests for global clustering only collect evidence for clustering throughout the study region without evaluating the statistical significance of the individual clusters. The weighted likelihood ratio (WLR) test based on the weighted sum of likelihood ratios represents an important class of tests for global clustering. Song and Kulldorff (Likelihood based tests for spatial randomness. Stat Med. 2006;25(5):825–839) developed a wide variety of weight functions with the WLR test for global clustering. However, these weight functions are often defined based on the cell population size or the geographic information such as area size and distance between cells. They do not make use of the information from the observed count, although the likelihood ratio of a potential cluster depends on both the observed count and its population size. In this paper, we develop a self-adjusted weight function to directly allocate weights onto the likelihood ratios according to their values. The power of the test was evaluated and compared with existing methods based on a benchmark data set. The comparison results favour the suggested test especially under global chain clustering models.  相似文献   

13.
Spectral clustering uses eigenvectors of the Laplacian of the similarity matrix. It is convenient to solve binary clustering problems. When applied to multi-way clustering, either the binary spectral clustering is recursively applied or an embedding to spectral space is done and some other methods, such as K-means clustering, are used to cluster the points. Here we propose and study a K-way clustering algorithm – spectral modular transformation, based on the fact that the graph Laplacian has an equivalent representation, which has a diagonal modular structure. The method first transforms the original similarity matrix into a new one, which is nearly disconnected and reveals a cluster structure clearly, then we apply linearized cluster assignment algorithm to split the clusters. In this way, we can find some samples for each cluster recursively using the divide and conquer method. To get the overall clustering results, we apply the cluster assignment obtained in the previous step as the initialization of multiplicative update method for spectral clustering. Examples show that our method outperforms spectral clustering using other initializations.  相似文献   

14.
In document clustering, a document may be assigned to multiple clusters and the probabilities of a document belonging to different clusters are directly normalized. We propose a new Posterior Probabilistic Clustering (PPC) model that has this normalization property. The clustering model is based on Nonnegative Matrix Factorization (NMF) and flexible such that if we use class conditional probability normalization, the model reduces to Probabilistic Latent Semantic Indexing (PLSI). Systematic comparison and evaluation indicates that PPC is competitive with other state-of-art clustering methods. Furthermore, the results of PPC are more sparse and orthogonal, both of which are highly desirable.  相似文献   

15.
We consider Dirichlet process mixture models in which the observed clusters in any particular dataset are not viewed as belonging to a finite set of possible clusters but rather as representatives of a latent structure in which objects belong to one of a potentially infinite number of clusters. As more information is revealed the number of inferred clusters is allowed to grow. The precision parameter of the Dirichlet process is a crucial parameter that controls the number of clusters. We develop a framework for the specification of the hyperparameters associated with the prior for the precision parameter that can be used both in the presence or absence of subjective prior information about the level of clustering. Our approach is illustrated in an analysis of clustering brands at the magazine Which?. The results are compared with the approach of Dorazio (2009) via a simulation study.  相似文献   

16.
The K-means clustering method is a widely adopted clustering algorithm in data mining and pattern recognition, where the partitions are made by minimizing the total within group sum of squares based on a given set of variables. Weighted K-means clustering is an extension of the K-means method by assigning nonnegative weights to the set of variables. In this paper, we aim to obtain more meaningful and interpretable clusters by deriving the optimal variable weights for weighted K-means clustering. Specifically, we improve the weighted k-means clustering method by introducing a new algorithm to obtain the globally optimal variable weights based on the Karush-Kuhn-Tucker conditions. We present the mathematical formulation for the clustering problem, derive the structural properties of the optimal weights, and implement an recursive algorithm to calculate the optimal weights. Numerical examples on simulated and real data indicate that our method is superior in both clustering accuracy and computational efficiency.  相似文献   

17.
Functional data analysis (FDA)—the analysis of data that can be considered a set of observed continuous functions—is an increasingly common class of statistical analysis. One of the most widely used FDA methods is the cluster analysis of functional data; however, little work has been done to compare the performance of clustering methods on functional data. In this article, a simulation study compares the performance of four major hierarchical methods for clustering functional data. The simulated data varied in three ways: the nature of the signal functions (periodic, non periodic, or mixed), the amount of noise added to the signal functions, and the pattern of the true cluster sizes. The Rand index was used to compare the performance of each clustering method. As a secondary goal, clustering methods were also compared when the number of clusters has been misspecified. To illustrate the results, a real set of functional data was clustered where the true clustering structure is believed to be known. Comparing the clustering methods for the real data set confirmed the findings of the simulation. This study yields concrete suggestions to future researchers to determine the best method for clustering their functional data.  相似文献   

18.
Existing statistical methods for the detection of space–time clusters of point events are retrospective, in that they are used to ascertain whether space–time clustering exists among a fixed number of past events. In contrast, prospective methods treat a series of observations sequentially, with the aim of detecting quickly any changes that occur in the series. In this paper, cumulative sum methods of monitoring are adapted for use with Knox's space–time statistic. The result is a procedure for the rapid detection of any emergent space–time interactions for a set of sequentially monitored point events. The approach relies on a 'local' Knox statistic that is useful in retrospective analyses to detect when and where space–time interaction occurs. The distribution of the local Knox statistic under the null hypothesis of no space–time interaction is derived. The retrospective local statistic and the prospective cumulative sum monitoring method are illustrated by using previously published data on Burkitt's lymphoma in Uganda.  相似文献   

19.
One of the most popular methods and algorithms to partition data to k clusters is k-means clustering algorithm. Since this method relies on some basic conditions such as, the existence of mean and finite variance, it is unsuitable for data that their variances are infinite such as data with heavy tailed distribution. Pitman Measure of Closeness (PMC) is a criterion to show how much an estimator is close to its parameter with respect to another estimator. In this article using PMC, based on k-means clustering, a new distance and clustering algorithm is developed for heavy tailed data.  相似文献   

20.
A proper understanding and modelling of the behaviour of heavily loaded large-scale electrical transmission systems is essential for a secure and uninterrupted operation. In this paper, we present methods to cluster electrical power networks based on different criteria into regions. These regions are useful for the efficient modelling of large transcontinental electricity networks, switching operation decisions or placement of redundant parts of the monitoring and control system. In alternating current electricity networks, power oscillations are normal, but they can become dangerous if they build up. The first approach uses the correlation between results of a stability assessment for these oscillations at every node for the cluster criterion. The second method concentrates on the network topology and uses spectral clustering on the network graph to create clusters where all nodes are interconnected. In this work, we also discuss the problem how to choose the right number of clusters and how the discussed clustering methods can be used for an efficient modelling of large electricity networks or in protection and control systems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号