期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Identification of target clusters by using the restricted normal mixture model

Seung-Gu Kim Yung-Seop Lee 《Journal of applied statistics》2013,40(5):941-960

This paper addresses the problem of identifying groups that satisfy the specific conditions for the means of feature variables. In this study, we refer to the identified groups as “target clusters” (TCs). To identify TCs, we propose a method based on the normal mixture model (NMM) restricted by a linear combination of means. We provide an expectation–maximization (EM) algorithm to fit the restricted NMM by using the maximum-likelihood method. The convergence property of the EM algorithm and a reasonable set of initial estimates are presented. We demonstrate the method's usefulness and validity through a simulation study and two well-known data sets. The proposed method provides several types of useful clusters, which would be difficult to achieve with conventional clustering or exploratory data analysis methods based on the ordinary NMM. A simple comparison with another target clustering approach shows that the proposed method is promising in the identification. 相似文献

2.

Exploring multivariate data using directions of high density

FOSTER PETER 《Statistics and Computing》1998,8(4):347-355

The most common techniques for graphically presenting a multivariate dataset involve projection onto a one or two-dimensional subspace. Interpretation of such plots is not always straightforward because projections are smoothing operations in that structure can be obscured by projection but never enhanced. In this paper an alternative procedure for finding interesting features is proposed that is based on locating the modes of an induced hyperspherical density function, and a simple algorithm for this purpose is developed. Emphasis is placed on identifying the non-linear effects, such as clustering, so to this end the data are firstly sphered to remove all of the location, scale and correlational structure. A set of simulated bivariate data and artistic qualities of painters data are used as examples. 相似文献

3.

Dimension reduction for model-based clustering 总被引：1，自引：0，他引：1

Luca Scrucca 《Statistics and Computing》2010,20(4):471-484

We introduce a dimension reduction method for visualizing the clustering structure obtained from a finite mixture of Gaussian densities. Information on the dimension reduction subspace is obtained from the variation on group means and, depending on the estimated mixture model, on the variation on group covariances. The proposed method aims at reducing the dimensionality by identifying a set of linear combinations, ordered by importance as quantified by the associated eigenvalues, of the original features which capture most of the cluster structure contained in the data. Observations may then be projected onto such a reduced subspace, thus providing summary plots which help to visualize the clustering structure. These plots can be particularly appealing in the case of high-dimensional data and noisy structure. The new constructed variables capture most of the clustering information available in the data, and they can be further reduced to improve clustering performance. We illustrate the approach on both simulated and real data sets. 相似文献

4.

Clustering of Variables Based on Watson Distribution on Hypersphere: A Comparison of Algorithms

Adelaide Figueiredo Paulo Gomes 《统计学通讯:模拟与计算》2015,44(10):2622-2635

We consider n individuals described by p variables, represented by points of the surface of unit hypersphere. We suppose that the individuals are fixed and the set of variables comes from a mixture of bipolar Watson distributions. For the mixture identification, we use EM and dynamic clusters algorithms, which enable us to obtain a partition of the set of variables into clusters of variables.

Our aim is to evaluate the clusters obtained in these algorithms, using measures of within-groups variability and between-groups variability and compare these clusters with those obtained in other clustering approaches, by analyzing simulated and real data. 相似文献

5.

Strong Consistency of Reduced K‐means Clustering

Yoshikazu Terada 《Scandinavian Journal of Statistics》2014,41(4):913-931

Reduced k‐means clustering is a method for clustering objects in a low‐dimensional subspace. The advantage of this method is that both clustering of objects and low‐dimensional subspace reflecting the cluster structure are simultaneously obtained. In this paper, the relationship between conventional k‐means clustering and reduced k‐means clustering is discussed. Conditions ensuring almost sure convergence of the estimator of reduced k‐means clustering as unboundedly increasing sample size have been presented. The results for a more general model considering conventional k‐means clustering and reduced k‐means clustering are provided in this paper. Moreover, a consistent selection of the numbers of clusters and dimensions is described. 相似文献

6.

A sequential clustering algorithm with applications to gene expression data

Jongwoo Song Dan L. Nicolae 《Journal of the Korean Statistical Society》2009,38(2):175-184

Clustering algorithms are used in the analysis of gene expression data to identify groups of genes with similar expression patterns. These algorithms group genes with respect to a predefined dissimilarity measure without using any prior classification of the data. Most of the clustering algorithms require the number of clusters as input, and all the objects in the dataset are usually assigned to one of the clusters. We propose a clustering algorithm that finds clusters sequentially, and allows for sporadic objects, so there are objects that are not assigned to any cluster. The proposed sequential clustering algorithm has two steps. First it finds candidates for centers of clusters. Multiple candidates are used to make the search for clusters more efficient. Secondly, it conducts a local search around the candidate centers to find the set of objects that defines a cluster. The candidate clusters are compared using a predefined score, the best cluster is removed from data, and the procedure is repeated. We investigate the performance of this algorithm using simulated data and we apply this method to analyze gene expression profiles in a study on the plasticity of the dendritic cells. 相似文献

7.

Clustering ranking data in market segmentation: a case study on the Italian McDonald's customers’ preferences†

Eugenio Brentari Livia Dancelli 《Journal of applied statistics》2016,43(11):1959-1976

Cluster analysis is often used for market segmentation. When the inputs in the clustering algorithm are ranking data, the intersubject (dis)similarities must be measured by matching-type measures, able to take account of the ordinal nature of the data. Among them, we used a Weighted Spearman's rho, suitably transformed into a (dis)similarity measure, in order to emphasize the concordance on the top ranks. This allows creating clusters grouping customers that place the same items (products, services, etc.) higher in their rankings. Also the statistical instruments used to interpret the clusters must be conceived to deal with ordinal data. The median and other location measures are appropriate but not always able to clearly differentiate groups. The so-called bipolar mean, with its related variability measure, may reveal some additional features. A case study on real data from a survey carried out in the Italian McDonald's restaurants is presented. 相似文献

8.

Clustering objects on subsets of attributes (with discussion)

Jerome H. Friedman Jacqueline J. Meulman 《Journal of the Royal Statistical Society. Series B, Statistical methodology》2004,66(4):815-849

Summary. A new procedure is proposed for clustering attribute value data. When used in conjunction with conventional distance-based clustering algorithms this procedure encourages those algorithms to detect automatically subgroups of objects that preferentially cluster on subsets of the attribute variables rather than on all of them simultaneously. The relevant attribute subsets for each individual cluster can be different and partially (or completely) overlap with those of other clusters. Enhancements for increasing sensitivity for detecting especially low cardinality groups clustering on a small subset of variables are discussed. Applications in different domains, including gene expression arrays, are presented. 相似文献

9.

On the strengths of the self-updating process clustering algorithm

《Journal of Statistical Computation and Simulation》2012,82(5):1010-1031

The self-updating process (SUP) is a clustering algorithm that stands from the viewpoint of data points and simulates the process how data points move and perform self-clustering. It is an iterative process on the sample space and allows for both time-varying and time-invariant operators. By simulations and comparisons, this paper shows that SUP is particularly competitive in clustering (i) data with noise, (ii) data with a large number of clusters, and (iii) unbalanced data. When noise is present in the data, SUP is able to isolate the noise data points while performing clustering simultaneously. The property of the local updating enables SUP to handle data with a large number of clusters and data of various structures. In this paper, we showed that the blurring mean-shift is a static SUP. Therefore, our discussions on the strengths of SUP also apply to the blurring mean-shift. 相似文献

10.

Using clustering of rankings to explain brand preferences with personality and socio-demographic variables

Daniel Müllensiefen Hedie Howells 《Journal of applied statistics》2018,45(6):1009-1029

The primary aim of market segmentation is to identify relevant groups of consumers that can be addressed efficiently by marketing or advertising campaigns. This paper addresses the issue whether consumer groups can be identified from background variables that are not brand-related, and how much personality vs. socio-demographic variables contribute to the identification of consumer clusters. This is done by clustering aggregated preferences for 25 brands across 5 different product categories, and by relating socio-demographic and personality variables to the clusters using logistic regression and random forests over a range of different numbers of clusters. Results indicate that some personality variables contribute significantly to the identification of consumer groups in one sample. However, these results were not replicated on a second sample that was more heterogeneous in terms of socio-demographic characteristics and not representative of the brands target audience. 相似文献

11.

基于改进的自适应传播模型的农业风险区划分析

谢远涛杨娟刘皓宇《统计与信息论坛》2017,(1):33-40

农业险定价中的核心问题是农业风险区划问题,为了体现农业区划中个体指标的动态发展特征,根据近邻传播改进自适应近邻传播聚类方法对数据进行优化,基于轮廓系数、归属度和吸引度得到最佳聚类中心和几何聚类中心,并将聚类转化为新数据集的聚类问题;选取代表性的棉花为例进行实证分析,通过计算生产、销售、收入、财政等指标进行棉花风险区划实例分析,计算最优棉花风险区划,结果表明对于具有动态特征的数据,本模型具有很好的有效性、实用性和解释性。相似文献

12.

A clustering approach to interpretable principal components

Doyo G. Enki Nickolay T. Trendafilov Ian T. Jolliffe 《Journal of applied statistics》2013,40(3):583-599

A new method for constructing interpretable principal components is proposed. The method first clusters the variables, and then interpretable (sparse) components are constructed from the correlation matrices of the clustered variables. For the first step of the method, a new weighted-variances method for clustering variables is proposed. It reflects the nature of the problem that the interpretable components should maximize the explained variance and thus provide sparse dimension reduction. An important feature of the new clustering procedure is that the optimal number of clusters (and components) can be determined in a non-subjective manner. The new method is illustrated using well-known simulated and real data sets. It clearly outperforms many existing methods for sparse principal component analysis in terms of both explained variance and sparseness. 相似文献

13.

面板数据加权聚类分析方法研究

张立军彭浩《统计与信息论坛》2017,(4):21-26

在面板数据聚类分析方法的研究中,基于面板数据兼具截面维度和时间维度的特征,对欧氏距离函数进行了改进,在聚类过程中考虑指标权重与时间权重,提出了适用于面板数据聚类分析的"加权距离函数"以及相应的Ward.D聚类方法。首先定义了考虑指标绝对值、邻近时点增长率以及波动变异程度的欧氏距离函数;然后,将指标权重与时间权重通过线性模型集结成综合加权距离,最终实现面板数据的加权聚类过程。实证分析结果显示,考虑指标权重与时间权重的面板数据加权聚类分析方法具有更好的分辨能力,能提高样本聚类的准确性。相似文献

14.

A self-adjusted weighted likelihood ratio test for global clustering of disease

《Journal of Statistical Computation and Simulation》2012,82(5):996-1009

Compared to tests for localized clusters, the tests for global clustering only collect evidence for clustering throughout the study region without evaluating the statistical significance of the individual clusters. The weighted likelihood ratio (WLR) test based on the weighted sum of likelihood ratios represents an important class of tests for global clustering. Song and Kulldorff (Likelihood based tests for spatial randomness. Stat Med. 2006;25(5):825–839) developed a wide variety of weight functions with the WLR test for global clustering. However, these weight functions are often defined based on the cell population size or the geographic information such as area size and distance between cells. They do not make use of the information from the observed count, although the likelihood ratio of a potential cluster depends on both the observed count and its population size. In this paper, we develop a self-adjusted weight function to directly allocate weights onto the likelihood ratios according to their values. The power of the test was evaluated and compared with existing methods based on a benchmark data set. The comparison results favour the suggested test especially under global chain clustering models. 相似文献

15.

Modular-transform based clustering

Gang Wang Jun Wang Mingyu Wang 《Journal of applied statistics》2013,40(12):2749-2759

Spectral clustering uses eigenvectors of the Laplacian of the similarity matrix. It is convenient to solve binary clustering problems. When applied to multi-way clustering, either the binary spectral clustering is recursively applied or an embedding to spectral space is done and some other methods, such as K-means clustering, are used to cluster the points. Here we propose and study a K-way clustering algorithm – spectral modular transformation, based on the fact that the graph Laplacian has an equivalent representation, which has a diagonal modular structure. The method first transforms the original similarity matrix into a new one, which is nearly disconnected and reveals a cluster structure clearly, then we apply linearized cluster assignment algorithm to split the clusters. In this way, we can find some samples for each cluster recursively using the divide and conquer method. To get the overall clustering results, we apply the cluster assignment obtained in the previous step as the initialization of multiplicative update method for spectral clustering. Examples show that our method outperforms spectral clustering using other initializations. 相似文献

16.

Community detection with structural and attribute similarities

Fengqin Tang Wenwen Ding 《Journal of Statistical Computation and Simulation》2019,89(4):668-685

An important problem in network analysis is to identify significant communities. Most of the real-world data sets exhibit a certain topological structure between nodes and the attributes describing them. In this paper, we propose a new community detection criterion considering both structural similarities and attribute similarities. The clustering method integrates the cost of clustering node attributes with the cost of clustering the structural information via the normalized modularity. We show that the joint clustering problem can be formulated as a spectral relaxation problem. The proposed algorithm is capable of learning the degree of contributions of individual node attributes. A number of numerical studies involving simulated and real data sets demonstrate the effectiveness of the proposed method. 相似文献

17.

An iterative algorithm for optimal variable weighting in K-means clustering

Shaonan Zhang Jiaqiao Hu Haipeng Xing Wei Zhu 《统计学通讯:模拟与计算》2019,48(5):1346-1365

The K-means clustering method is a widely adopted clustering algorithm in data mining and pattern recognition, where the partitions are made by minimizing the total within group sum of squares based on a given set of variables. Weighted K-means clustering is an extension of the K-means method by assigning nonnegative weights to the set of variables. In this paper, we aim to obtain more meaningful and interpretable clusters by deriving the optimal variable weights for weighted K-means clustering. Specifically, we improve the weighted k-means clustering method by introducing a new algorithm to obtain the globally optimal variable weights based on the Karush-Kuhn-Tucker conditions. We present the mathematical formulation for the clustering problem, derive the structural properties of the optimal weights, and implement an recursive algorithm to calculate the optimal weights. Numerical examples on simulated and real data indicate that our method is superior in both clustering accuracy and computational efficiency. 相似文献

18.

Type i error rates for divisive clustering methods for grouping means in analysis of variance

S. G. Carmer W. T. Lin 《统计学通讯:模拟与计算》2013,42(4):451-466

Five univariate divisive clustering methods for grouping means in analysis of variance are considered.Unlike pairwise multiple comparison procedures, cluster analysis has the advantage of producing non-overlapping groups of the treatment means. Comparisonwise Type I error rates and average numbers of clusters per experiment are examined for a heterogeneous set of 20 true treatment means with 11 embedded homogenous sub-groups of one or more treatments. The results of a simulation study clearly show that observed comparisonwise error rate and number of clusters are determined to a far greater extent by the precision of the experiment (as determined by the magnitude of the standard deviation) than by either the stated significance level or the clustering method used. 相似文献

19.

Profiles identification on hierarchical tree structure data sets

Conceição Rocha Pedro Quelhas Brito 《Journal of applied statistics》2018,45(15):2848-2863

In this work we study a way to explore and extract more information from data sets with a hierarchical tree structure. We propose that any statistical study on this type of data should be made by group, after clustering. In this sense, the most adequate approach is to use the Mahalanobis–Wasserstein distance as a measure of similarity between the cases, to carry out clustering or unsupervised classification. This methodology allows for the clustering of cases, as well as the identification of their profiles, based on the distribution of all the variables that characterises each subject associated with each case. An application to a set of teenagers' interviews regarding their habits of communication is described. The interviewees answered several questions about the kind of contacts they had on their phone, Facebook, email or messenger as well as the frequency of communication between them. The results indicate that the methodology is adequate to cluster this kind of data sets, since it allows us to identify and characterise different profiles from the data. We compare the results obtained with this methodology with the ones obtained using the entire database, and we conclude that they may lead to different findings. 相似文献

20.

基于自适应迭代更新的函数型数据聚类方法研究

王德青等《统计研究》2015,32(4):91-96

函数型数据的稀疏性和无穷维特性使得传统聚类分析失效。针对此问题,本文在界定函数型数据概念与内涵的基础上提出了一种自适应迭代更新聚类分析。首先,基于数据参数信息实现无穷维函数空间向有限维多元空间的过渡;在此基础上,依据变量信息含量的差异构建了自适应赋权聚类统计量,并依此为函数型数据的相似性测度进行初始类别划分;进一步地,在给定阈值限制下,对所有函数的初始类别归属进行自适应迭代更新,将收敛的优化结果作为最终的类别划分。随机模拟和实证检验表明,与现有的同类函数型聚类分析相比,文中方法的分类正确率显著提高,体现了新方法的相对优良性和实际问题应用中的有效性。相似文献