期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Clustering algorithm for proximity-relation matrix and its applications

Wen-Liang Hung De-Hua Chen 《Journal of applied statistics》2013,40(9):1875-1892

In this paper, we present a new algorithm for clustering proximity-relation matrix that does not require the transitivity property. The proposed algorithm is first inspired by the idea of Yang and Wu [16] then turned into a self-organizing process that is built upon the intuition behind clustering. At the end of the process subjects belonging to be the same cluster should converge to the same point, which represents the cluster center. However, the performance of Yang and Wu's algorithm depends on parameter selection. In this paper, we use the partition entropy (PE) index to choose it. Numerical result illustrates that the proposed method does not only solve the parameter selection problem but also obtains an optimal clustering result. Finally, we apply the proposed algorithm to three applications. One is to evaluate the performance of higher education in Taiwan, another is machine–parts grouping in cellular manufacturing systems, and the other is to cluster probability density functions. 相似文献

2.

Simple Measures of Individual Cluster-Membership Certainty for Hard Partitional Clustering

Dongmeng Liu Jinko Graham 《The American statistician》2019,73(1):70-79

We propose two probability-like measures of individual cluster-membership certainty that can be applied to a hard partition of the sample such as that obtained from the partitioning around medoids (PAM) algorithm, hierarchical clustering or k-means clustering. One measure extends the individual silhouette widths and the other is obtained directly from the pairwise dissimilarities in the sample. Unlike the classic silhouette, however, the measures behave like probabilities and can be used to investigate an individual’s tendency to belong to a cluster. We also suggest two possible ways to evaluate the hard partition using these measures. We evaluate the performance of both measures in individuals with ambiguous cluster membership, using simulated binary datasets that have been partitioned by the PAM algorithm or continuous datasets that have been partitioned by hierarchical clustering and k-means clustering. For comparison, we also present results from soft-clustering algorithms such as soft analysis clustering (FANNY) and two model-based clustering methods. Our proposed measures perform comparably to the posterior probability estimators from either FANNY or the model-based clustering methods. We also illustrate the proposed measures by applying them to Fisher’s classic dataset on irises. 相似文献

3.

基于高维分步投影的多重分区聚类算法

张维群陈文浩《统计与信息论坛》2017,(2):18-22

数据分布密度划分的聚类算法是数据挖掘聚类算法的主要方法之一。针对传统密度划分聚类算法存在运算复杂、运行效率不高等缺陷，设计高维分步投影的多重分区聚类算法；以高维分布投影密度为依据，对数据集进行多重分区，产生数据集的子簇空间，并进行子簇合并，形成理想的聚类结果；依据该算法进行实验，结果证明该算法具有运算简单和运行效率高等优良性。相似文献

4.

Self-updating clustering algorithm for estimating the parameters in mixtures of von Mises distributions

Wen-Liang Hung Shou-Jen Chang-Chien Miin-Shen Yang 《Journal of applied statistics》2012,39(10):2259-2274

The EM algorithm is the standard method for estimating the parameters in finite mixture models. Yang and Pan [25] proposed a generalized classification maximum likelihood procedure, called the fuzzy c-directions (FCD) clustering algorithm, for estimating the parameters in mixtures of von Mises distributions. Two main drawbacks of the EM algorithm are its slow convergence and the dependence of the solution on the initial value used. The choice of initial values is of great importance in the algorithm-based literature as it can heavily influence the speed of convergence of the algorithm and its ability to locate the global maximum. On the other hand, the algorithmic frameworks of EM and FCD are closely related. Therefore, the drawbacks of FCD are the same as those of the EM algorithm. To resolve these problems, this paper proposes another clustering algorithm, which can self-organize local optimal cluster numbers without using cluster validity functions. These numerical results clearly indicate that the proposed algorithm is superior in performance of EM and FCD algorithms. Finally, we apply the proposed algorithm to two real data sets. 相似文献

5.

On the regularized Laplacian eigenmaps

Ying Cao Di-Rong Chen 《Journal of statistical planning and inference》2012

To find an appropriate low-dimensional representation for complex data is one of the central problems in machine learning and data analysis. In this paper, a nonlinear dimensionality reduction algorithm called regularized Laplacian eigenmaps (RLEM) is proposed, motivated by the method for regularized spectral clustering. This algorithm provides a natural out-of-sample extension for dealing with points not in the original data set. The consistency of the RLEM algorithm is investigated. Moreover, a convergence rate is established depending on the approximation property and the capacity of the reproducing kernel Hilbert space measured by covering numbers. Experiments are given to illustrate our algorithm. 相似文献

6.

基于密度的面板数据聚类分析

杨娟谢远涛《统计与信息论坛》2014,(2):23-28

研究面板数据聚类问题过程中,在相似性度量上,用Logistic回归模型构造相似系数和非对称相似矩阵。在聚类算法上,目前的聚类算法只适用于对称的相似矩阵。在非对称相似矩阵的聚类算法上,采用最佳优先搜索和轮廓系数,改进DBSCAN聚类方法,提出BF—DBSCAN方法。通过实例分析,比较了BF—DBSCAN和DBSCAN方法的聚类结果,以及不同参数设置对BF—DBSCAN聚类结果的影响,验证了该方法的有效性和实用性。相似文献

7.

Bayesian nonparametric clustering for large data sets

Zuanetti Daiane Aparecida Müller Peter Zhu Yitan Yang Shengjie Ji Yuan 《Statistics and Computing》2019,29(2):203-215

We propose two nonparametric Bayesian methods to cluster big data and apply them to cluster genes by patterns of gene–gene interaction. Both approaches define model-based clustering with nonparametric Bayesian priors and include an implementation that remains feasible for big data. The first method is based on a predictive recursion which requires a single cycle (or few cycles) of simple deterministic calculations for each observation under study. The second scheme is an exact method that divides the data into smaller subsamples and involves local partitions that can be determined in parallel. In a second step, the method requires only the sufficient statistics of each of these local clusters to derive global clusters. Under simulated and benchmark data sets the proposed methods compare favorably with other clustering algorithms, including k-means, DP-means, DBSCAN, SUGS, streaming variational Bayes and an EM algorithm. We apply the proposed approaches to cluster a large data set of gene–gene interactions extracted from the online search tool “Zodiac.”

相似文献

8.

An automatic clustering algorithm for probability density functions

《Journal of Statistical Computation and Simulation》2012,82(15):3047-3063

We propose an intuitive and computationally simple algorithm for clustering the probability density functions (pdfs). A data-driven learning mechanism is incorporated in the algorithm in order to determine the suitable widths of the clusters. The clustering results prove that the proposed algorithm is able to automatically group the pdfs and provide the optimal cluster number without any a priori information. The performance study also shows that the proposed algorithm is more efficient than existing ones. In addition, the clustering can serve as the intermediate compression tool in content-based multimedia retrieval that we apply the proposed algorithm to categorize a subset of COREL image database. And the clustering results indicate that the proposed algorithm performs well in colour image categorization. 相似文献

9.

基于核密度估计的非线性时间序列聚类

张贝贝《统计教育》2010,(4):15-20

本文研究的是时间序列的聚类问题。由于现实世界中时间序列多数是非线性的,而现有的时间序列聚类问题大都是基于线性时间序列模型进行聚类的,本文提出了可以用于非线性时间序列的聚类方法。以时间序列的二维核密度估计之间的相似性作为非线性时间序列的距离度量,该距离度量方式是一种非参数的距离度量方法,考虑到了时间序列自相关结构的差异,能够粗糙地识别时间序列形状和动态相关结构的相似性。与理论研究结果相一致,我们的模拟实验结果也验证了这种距离度量的有效性。相似文献

10.

A sequential clustering algorithm with applications to gene expression data

Jongwoo Song Dan L. Nicolae 《Journal of the Korean Statistical Society》2009,38(2):175-184

Clustering algorithms are used in the analysis of gene expression data to identify groups of genes with similar expression patterns. These algorithms group genes with respect to a predefined dissimilarity measure without using any prior classification of the data. Most of the clustering algorithms require the number of clusters as input, and all the objects in the dataset are usually assigned to one of the clusters. We propose a clustering algorithm that finds clusters sequentially, and allows for sporadic objects, so there are objects that are not assigned to any cluster. The proposed sequential clustering algorithm has two steps. First it finds candidates for centers of clusters. Multiple candidates are used to make the search for clusters more efficient. Secondly, it conducts a local search around the candidate centers to find the set of objects that defines a cluster. The candidate clusters are compared using a predefined score, the best cluster is removed from data, and the procedure is repeated. We investigate the performance of this algorithm using simulated data and we apply this method to analyze gene expression profiles in a study on the plasticity of the dendritic cells. 相似文献

11.

Peter Bühlmann Philipp Rütimann Sara van de Geer Cun-Hui Zhang 《Journal of statistical planning and inference》2013

We consider estimation in a high-dimensional linear model with strongly correlated variables. We propose to cluster the variables first and do subsequent sparse estimation such as the Lasso for cluster-representatives or the group Lasso based on the structure from the clusters. Regarding the first step, we present a novel and bottom-up agglomerative clustering algorithm based on canonical correlations, and we show that it finds an optimal solution and is statistically consistent. We also present some theoretical arguments that canonical correlation based clustering leads to a better-posed compatibility constant for the design matrix which ensures identifiability and an oracle inequality for the group Lasso. Furthermore, we discuss circumstances where cluster-representatives and using the Lasso as subsequent estimator leads to improved results for prediction and detection of variables. We complement the theoretical analysis with various empirical results. 相似文献

12.

Comparison of Times Series with Unequal Length in the Frequency Domain

Jorge Caiado Nuno Crato Daniel Peña 《统计学通讯:模拟与计算》2013,42(3):527-540

In statistical data analysis it is often important to compare, classify, and cluster different time series. For these purposes various methods have been proposed in the literature, but they usually assume time series with the same sample size. In this article, we propose a spectral domain method for handling time series of unequal length. The method make the spectral estimates comparable by producing statistics at the same frequency. The procedure is compared with other methods proposed in the literature by a Monte Carlo simulation study. As an illustrative example, the proposed spectral method is applied to cluster industrial production series of some developed countries. 相似文献

13.

A Clustering Method for Categorical Ordinal Data

Marco Giordan Giancarlo Diana 《统计学通讯:理论与方法》2013,42(7):1315-1334

Often, categorical ordinal data are clustered using a well-defined similarity measure for this kind of data and then using a clustering algorithm not specifically developed for them. The aim of this article is to introduce a new clustering method suitably planned for ordinal data. Objects are grouped using a multinomial model, a cluster tree and a pruning strategy. Two types of pruning are analyzed through simulations. The proposed method allows to overcome two typical problems of cluster analysis: the choice of the number of groups and the scale invariance. 相似文献

14.

An unsupervised, ensemble clustering algorithm: A new approach for classification of X-ray sources

S.M. Hojnacki G. Micela S.M. LaLonde E.D. Feigelson J.H. Kastner 《Statistical Methodology》2008,5(4):350-360

A large volume of CCD X-ray spectra is being generated by the Chandra X-ray Observatory (Chandra) and XMM-Newton. Automated spectral analysis and classification methods can aid in sorting, characterizing, and classifying this large volume of CCD X-ray spectra in a non-parametric fashion, complementary to current parametric model fits. We have developed an algorithm that uses multivariate statistical techniques, including an ensemble clustering method, applied for the first time for X-ray spectral classification. The algorithm uses spectral data to group similar discrete sources of X-ray emission by placing the X-ray sources in a three-dimensional spectral sequence and then grouping the ordered sources into clusters based on their spectra. This new method can handle large quantities of data and operate independently of the requirement of spectral source models and a priori knowledge concerning the nature of the sources (i.e., young stars, interacting binaries, active galactic nuclei). We apply the method to Chandra imaging spectroscopy of the young stellar clusters in the Orion Nebula Cluster and the NGC 1333 star formation region. 相似文献

15.

Evolving possibilistic fuzzy modelling

Leandro Maciel Rosangela Ballini Fernando Gomide 《Journal of Statistical Computation and Simulation》2017,87(7):1446-1466

This paper suggests an evolving possibilistic approach for fuzzy modelling of time-varying processes. The approach is based on an extension of the well-known possibilistic fuzzy c-means (FCM) clustering and functional fuzzy rule-based modelling. Evolving possibilistic fuzzy modelling (ePFM) employs memberships and typicalities to recursively cluster data, and uses participatory learning to adapt the model structure as a stream data is input. The idea of possibilistic clustering plays a key role when the data are noisy and with outliers due to the relaxation of the restriction on membership degrees to add up unity in FCM clustering algorithm. To show the usefulness of ePFM, the approach is addressed for system identification using Box & Jenkins gas furnace data as well as time series forecasting considering the chaotic Mackey–Glass series and data produced by a synthetic time-varying process with parameter drift. The results show that ePFM is a potential candidate for nonlinear time-varying systems modelling, with comparable or better performance than alternative approaches, mainly when noise and outliers affect the data available. 相似文献

16.

非线性面板数据聚类方法研究

孙艳黄咏宁《统计与信息论坛》2017,(2):32-36

对于一类变量非线性相关的面板数据,现有的基于线性算法的面板数据聚类方法并不能准确地度量样本间的相似性,且聚类结果的可解释性低。综合考虑变量非线性相关问题及聚类结果可解释性问题,提出一种非线性面板数据的聚类方法,通过非线性核主成分算法实现对样本相似性的测度,并基于混合高斯模型进行样本概率聚类,实证表明该方法的有效性及其对聚类结果的可解释性有所提高。相似文献

17.

Community detection with structural and attribute similarities

Fengqin Tang Wenwen Ding 《Journal of Statistical Computation and Simulation》2019,89(4):668-685

An important problem in network analysis is to identify significant communities. Most of the real-world data sets exhibit a certain topological structure between nodes and the attributes describing them. In this paper, we propose a new community detection criterion considering both structural similarities and attribute similarities. The clustering method integrates the cost of clustering node attributes with the cost of clustering the structural information via the normalized modularity. We show that the joint clustering problem can be formulated as a spectral relaxation problem. The proposed algorithm is capable of learning the degree of contributions of individual node attributes. A number of numerical studies involving simulated and real data sets demonstrate the effectiveness of the proposed method. 相似文献

18.

A new algorithm for clustering based on kernel density estimation

L. C. Matioli S.R. Santos M. Kleina E. A. Leite 《Journal of applied statistics》2018,45(2):347-366

In this paper, we present an algorithm for clustering based on univariate kernel density estimation, named ClusterKDE. It consists of an iterative procedure that in each step a new cluster is obtained by minimizing a smooth kernel function. Although in our applications we have used the univariate Gaussian kernel, any smooth kernel function can be used. The proposed algorithm has the advantage of not requiring a priori the number of cluster. Furthermore, the ClusterKDE algorithm is very simple, easy to implement, well-defined and stops in a finite number of steps, namely, it always converges independently of the initial point. We also illustrate our findings by numerical experiments which are obtained when our algorithm is implemented in the software Matlab and applied to practical applications. The results indicate that the ClusterKDE algorithm is competitive and fast when compared with the well-known Clusterdata and K-means algorithms, used by Matlab to clustering data. 相似文献

19.

基于高斯谱聚类的风险商户聚类分析

黄丹阳等《统计研究》2021,38(6):145-160

随着电子支付的普及,市场涌现出越来越多的第三方支付平台,而当前关于第三方支付平台商户风险方面的研究相对较少。故本文提出基于高斯谱聚类的风险商户聚类方法,首先使用高斯混合模型构建交易-交易群体的双模网络;其次借助网络中信息传递的思想构建“商户-交易群体网络”的双模网络;再次使用双模网络聚类方法中的谱聚类方法同时对网络中的两类节点聚类,对商户节点聚类的结果可区分出不同风险级别的商户,对交易群体节点聚类的结果可以进一步描述风险商户的交易特征;最后本文分别在模拟数据和某第方支付平台的实际数据中验证了模型的有效性。实验结果表明,本文提出的方法不仅可以准确地区分出不同风险级别的商户群体,而且能总结归纳风险商户的交易特征,为风险商户的监管提供参考。相似文献

20.

Weighted Support Vector Machine Using k-Means Clustering

Sungwan Bang 《统计学通讯:模拟与计算》2013,42(10):2307-2324

The support vector machine (SVM) has been successfully applied to various classification areas with great flexibility and a high level of classification accuracy. However, the SVM is not suitable for the classification of large or imbalanced datasets because of significant computational problems and a classification bias toward the dominant class. The SVM combined with the k-means clustering (KM-SVM) is a fast algorithm developed to accelerate both the training and the prediction of SVM classifiers by using the cluster centers obtained from the k-means clustering. In the KM-SVM algorithm, however, the penalty of misclassification is treated equally for each cluster center even though the contributions of different cluster centers to the classification can be different. In order to improve classification accuracy, we propose the WKM–SVM algorithm which imposes different penalties for the misclassification of cluster centers by using the number of data points within each cluster as a weight. As an extension of the WKM–SVM, the recovery process based on WKM–SVM is suggested to incorporate the information near the optimal boundary. Furthermore, the proposed WKM–SVM can be successfully applied to imbalanced datasets with an appropriate weighting strategy. Experiments show the effectiveness of our proposed methods. 相似文献