期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

杨贵军陈玮晓《统计与信息论坛》2012,(1):107-112

将相关分析和有向聚类分析结合,提出有向相关聚类方法。先依据相关性进行变量合并,再进行有向聚类,分析结果更合理,聚类过程更简单。将该方法用于大学生健康成长影响因素的调查数据,得出更合理的结果。相似文献

2.

Evaluation of an analysis approach used to account for extra-variation in clustered categorical responses

Michael E. Miller J. Richard Landis 《统计学通讯:理论与方法》2013,42(8):2645-2661

This article presents the results of a simulation study investigating the performance of an approach developed by Miller and Landis (1991) for the analysis of clustered categorical responses. Evaluation of this “two-step” approach, which utilizes the method of moments to estimate the extra-variation pardmeters and subsequently incorporates these parameters into estimating equations for modelling the marginal expectations, is carried out in an experimental setting involving a comparison between two groups of observations. We assume that data for both groups are collected from each cluster and responses are measured on a three-point ordinal scale. The performance of the estimators used in both “steps” of the analysisis investigated and comparisons are made to an alternative analysismethod that ignores the clustering. The results indicate that in the chosen setting the test for a difference between groups generally operatbs at the nominal α=0.05 for 10 or more clusters and hasincreasing power with both an increasing number of clusters and an inrreasing treatment effect. These results provide a striking contrasc to those obtained from an improper analysis that ignores clustering. 相似文献

3.

A sequential clustering algorithm with applications to gene expression data

Jongwoo Song Dan L. Nicolae 《Journal of the Korean Statistical Society》2009,38(2):175-184

Clustering algorithms are used in the analysis of gene expression data to identify groups of genes with similar expression patterns. These algorithms group genes with respect to a predefined dissimilarity measure without using any prior classification of the data. Most of the clustering algorithms require the number of clusters as input, and all the objects in the dataset are usually assigned to one of the clusters. We propose a clustering algorithm that finds clusters sequentially, and allows for sporadic objects, so there are objects that are not assigned to any cluster. The proposed sequential clustering algorithm has two steps. First it finds candidates for centers of clusters. Multiple candidates are used to make the search for clusters more efficient. Secondly, it conducts a local search around the candidate centers to find the set of objects that defines a cluster. The candidate clusters are compared using a predefined score, the best cluster is removed from data, and the procedure is repeated. We investigate the performance of this algorithm using simulated data and we apply this method to analyze gene expression profiles in a study on the plasticity of the dendritic cells. 相似文献

4.

Modular-transform based clustering

Gang Wang Jun Wang Mingyu Wang 《Journal of applied statistics》2013,40(12):2749-2759

Spectral clustering uses eigenvectors of the Laplacian of the similarity matrix. It is convenient to solve binary clustering problems. When applied to multi-way clustering, either the binary spectral clustering is recursively applied or an embedding to spectral space is done and some other methods, such as K-means clustering, are used to cluster the points. Here we propose and study a K-way clustering algorithm – spectral modular transformation, based on the fact that the graph Laplacian has an equivalent representation, which has a diagonal modular structure. The method first transforms the original similarity matrix into a new one, which is nearly disconnected and reveals a cluster structure clearly, then we apply linearized cluster assignment algorithm to split the clusters. In this way, we can find some samples for each cluster recursively using the divide and conquer method. To get the overall clustering results, we apply the cluster assignment obtained in the previous step as the initialization of multiplicative update method for spectral clustering. Examples show that our method outperforms spectral clustering using other initializations. 相似文献

5.

On clustering shape data

《Journal of Statistical Computation and Simulation》2012,82(15):2995-3008

ABSTRACT

Among the statistical methods to model stochastic behaviours of objects, clustering is a preliminary technique to recognize similar patterns within a group of observations in a data set. Various distances to measure differences among objects could be invoked to cluster data through numerous clustering methods. When variables in hand contain geometrical information of objects, such metrics should be adequately adapted. In fact, statistical methods for these typical data are endowed with a geometrical paradigm in a multivariate sense. In this paper, a procedure for clustering shape data is suggested employing appropriate metrics. Then, the best shape distance candidate as well as a suitable agglomerative method for clustering the simulated shape data are provided by considering cluster validation measures. The results are implemented in a real life application. 相似文献

6.

Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm

Christophe Biernacki Julien Jacques 《Statistics and Computing》2016,26(5):929-943

We design a probability distribution for ordinal data by modeling the process generating data, which is assumed to rely only on order comparisons between categories. Contrariwise, most competitors often either forget the order information or add a non-existent distance information. The data generating process is assumed, from optimality arguments, to be a stochastic binary search algorithm in a sorted table. The resulting distribution is natively governed by two meaningful parameters (position and precision) and has very appealing properties: decrease around the mode, shape tuning from uniformity to a Dirac, identifiability. Moreover, it is easily estimated by an EM algorithm since the path in the stochastic binary search algorithm can be considered as missing values. Using then the classical latent class assumption, the previous univariate ordinal model is straightforwardly extended to model-based clustering for multivariate ordinal data. Parameters of this mixture model are estimated by an AECM algorithm. Both simulated and real data sets illustrate the great potential of this model by its ability to parsimoniously identify particularly relevant clusters which were unsuspected by some traditional competitors. 相似文献

7.

Cluster Data Streams with Noisy Variables

Hu Yang Chenqun Yu 《统计学通讯:模拟与计算》2016,45(4):1381-1396

Clustering algorithms are important methods widely used in mining data streams because of their abilities to deal with infinite data flows. Although these algorithms perform well to mining latent relationship in data streams, most of them suffer from loss of cluster purity and become unstable when the inputting data streams have too many noisy variables. In this article, we propose a clustering algorithm to cluster data streams with noisy variables. The result from simulation shows that our proposal method is better than previous studies by adding a process of variable selection as a component in clustering algorithms. The results of two experiments indicate that clustering data streams with the process of variable selection are more stable and have better purity than those without such process. Another experiment testing KDD-CUP99 dataset also shows that our algorithm can generate more stable result. 相似文献

8.

General location multivariate latent variable models for mixed correlated bounded continuous,ordinal, and nominal responses with non-ignorable missing data

Elham Tabrizi Ehsan Bahrami Samani Mojtaba Ganjali 《Journal of applied statistics》2021,48(5):765

Using a multivariate latent variable approach, this article proposes some new general models to analyze the correlated bounded continuous and categorical (nominal or/and ordinal) responses with and without non-ignorable missing values. First, we discuss regression methods for jointly analyzing continuous, nominal, and ordinal responses that we motivated by analyzing data from studies of toxicity development. Second, using the beta and Dirichlet distributions, we extend the models so that some bounded continuous responses are replaced for continuous responses. The joint distribution of the bounded continuous, nominal and ordinal variables is decomposed into a marginal multinomial distribution for the nominal variable and a conditional multivariate joint distribution for the bounded continuous and ordinal variables given the nominal variable. We estimate the regression parameters under the new general location models using the maximum-likelihood method. Sensitivity analysis is also performed to study the influence of small perturbations of the parameters of the missing mechanisms of the model on the maximal normal curvature. The proposed models are applied to two data sets: BMI, Steatosis and Osteoporosis data and Tehran household expenditure budgets. 相似文献

9.

Cluster detection and clustering with random start forward searches

Anthony C. Atkinson Marco Riani Andrea Cerioli 《Journal of applied statistics》2018,45(5):777-798

The forward search is a method of robust data analysis in which outlier free subsets of the data of increasing size are used in model fitting; the data are then ordered by closeness to the model. Here the forward search, with many random starts, is used to cluster multivariate data. These random starts lead to the diagnostic identification of tentative clusters. Application of the forward search to the proposed individual clusters leads to the establishment of cluster membership through the identification of non-cluster members as outlying. The method requires no prior information on the number of clusters and does not seek to classify all observations. These properties are illustrated by the analysis of 200 six-dimensional observations on Swiss banknotes. The importance of linked plots and brushing in elucidating data structures is illustrated. We also provide an automatic method for determining cluster centres and compare the behaviour of our method with model-based clustering. In a simulated example with eight clusters our method provides more stable and accurate solutions than model-based clustering. We consider the computational requirements of both procedures. 相似文献

10.

A Comparison of Hierarchical Methods for Clustering Functional Data

Laura Ferreira 《统计学通讯:模拟与计算》2013,42(9):1925-1949

Functional data analysis (FDA)—the analysis of data that can be considered a set of observed continuous functions—is an increasingly common class of statistical analysis. One of the most widely used FDA methods is the cluster analysis of functional data; however, little work has been done to compare the performance of clustering methods on functional data. In this article, a simulation study compares the performance of four major hierarchical methods for clustering functional data. The simulated data varied in three ways: the nature of the signal functions (periodic, non periodic, or mixed), the amount of noise added to the signal functions, and the pattern of the true cluster sizes. The Rand index was used to compare the performance of each clustering method. As a secondary goal, clustering methods were also compared when the number of clusters has been misspecified. To illustrate the results, a real set of functional data was clustered where the true clustering structure is believed to be known. Comparing the clustering methods for the real data set confirmed the findings of the simulation. This study yields concrete suggestions to future researchers to determine the best method for clustering their functional data. 相似文献

11.

A spatial scan statistic for survival data based on generalized life distribution

Vijaya Bhatt Neeraj Tiwari 《统计学通讯:理论与方法》2013,42(19):5730-5744

ABSTRACT

For many years, detection of clusters has been of great public health interest and widely studied. Several methods have been developed to detect clusters and their performance has been evaluated in various contexts. Spatial scan statistics are widely used for geographical cluster detection and inference. Different types of discrete or continuous data can be analyzed using spatial scan statistics for Bernoulli, Poisson, ordinal, exponential, and normal models. In this paper, we propose a scan statistic for survival data which is based on generalized life distribution model that provides three important life distributions, viz. Weibull, exponential, and Rayleigh. The proposed method is applied to the survival data of tuberculosis patients in Nainital district of Uttarakhand, India, for the year 2004–05. The Monte Carlo simulation studies reveal that the proposed method performs well for different survival distributions. 相似文献

12.

ORDANOVA: Analysis of ordinal variation

Tamar Gadrich Emil Bashkansky 《Journal of statistical planning and inference》2012

In order to accelerate object evaluation, some measurement systems commonly use an ordinal scale (e.g., stick results, quality estimation). This paper presents a way to analyze ordinal data variation. As in classical ANOVA for continual data, ORDANOVA for ordinal data splits the total variation into within and between components. This decomposition has various practical applications such as classification, cluster analysis, distinguishing feature identification and so on. 相似文献

13.

Clustering of Variables Around Latent Components 总被引：1，自引：0，他引：1

《统计学通讯:模拟与计算》2013,42(4):1131-1150

Abstract

Clustering of variables around latent components is investigated as a means to organize multivariate data into meaningful structures. The coverage includes (i) the case where it is desirable to lump together correlated variables no matter whether the correlation coefficient is positive or negative; (ii) the case where negative correlation shows high disagreement among variables; (iii) an extension of the clustering techniques which makes it possible to explain the clustering of variables taking account of external data. The strategy basically consists in performing a hierarchical cluster analysis, followed by a partitioning algorithm. Both algorithms aim at maximizing the same criterion which reflects the extent to which variables in each cluster are related to the latent variable associated with this cluster. Illustrations are outlined using real data sets from sensory studies. 相似文献

14.

面板数据加权聚类分析方法研究

张立军彭浩《统计与信息论坛》2017,(4):21-26

在面板数据聚类分析方法的研究中,基于面板数据兼具截面维度和时间维度的特征,对欧氏距离函数进行了改进,在聚类过程中考虑指标权重与时间权重,提出了适用于面板数据聚类分析的"加权距离函数"以及相应的Ward.D聚类方法。首先定义了考虑指标绝对值、邻近时点增长率以及波动变异程度的欧氏距离函数;然后,将指标权重与时间权重通过线性模型集结成综合加权距离,最终实现面板数据的加权聚类过程。实证分析结果显示,考虑指标权重与时间权重的面板数据加权聚类分析方法具有更好的分辨能力,能提高样本聚类的准确性。相似文献

15.

Better alternatives to current methods of scaling and weighting data for cluster analysis

R. Gnanadesikan J.R. Kettenring Srinivas Maloor 《Journal of statistical planning and inference》2007

Scaling of multivariate data prior to cluster analysis is important as a preprocessing step. Currently there are methods for doing this. This paper proposes some alternatives, which are particularly directed at helping reveal cluster structures in data. These methods are applied to simulated and real data sets and their performances are compared to some currently used methods. The results indicate that, in many situations, the new methods are much better than the most popular method, called autoscaling. In the most challenging clustering example considered, their performances, while poor, are no worse than all the currently used methods. 相似文献

16.

Comparing the methods of measuring multi-rater agreement on an ordinal rating scale: a simulation study with an application to real data

Y. Sertdemir H. R. Burgut Z. N. Alparslan I. Unal S. Gunasti 《Journal of applied statistics》2013,40(7):1506-1519

Agreement among raters is an important issue in medicine, as well as in education and psychology. The agreement among two raters on a nominal or ordinal rating scale has been investigated in many articles. The multi-rater case with normally distributed ratings has also been explored at length. However, there is a lack of research on multiple raters using an ordinal rating scale. In this simulation study, several methods were compared with analyze rater agreement. The special case that was focused on was the multi-rater case using a bounded ordinal rating scale. The proposed methods for agreement were compared within different settings. Three main ordinal data simulation settings were used (normal, skewed and shifted data). In addition, the proposed methods were applied to a real data set from dermatology. The simulation results showed that the Kendall's W and mean gamma highly overestimated the agreement in data sets with shifts in data. ICC₄ for bounded data should be avoided in agreement studies with rating scales<5, where this method highly overestimated the simulated agreement. The difference in bias for all methods under study, except the mean gamma and Kendall's W, decreased as the rating scale increased. The bias of ICC₃ was consistent and small for nearly all simulation settings except the low agreement setting in the shifted data set. Researchers should be careful in selecting agreement methods, especially if shifts in ratings between raters exist and may apply more than one method before any conclusions are made. 相似文献

17.

Model-based clustering of Gaussian copulas for mixed data 总被引：1，自引：0，他引：1

Matthieu Marbac Christophe Biernacki Vincent Vandewalle 《统计学通讯:理论与方法》2017,46(23):11635-11656

Clustering of mixed data is important yet challenging due to a shortage of conventional distributions for such data. In this article, we propose a mixture model of Gaussian copulas for clustering mixed data. Indeed copulas, and Gaussian copulas in particular, are powerful tools for easily modeling the distribution of multivariate variables. This model clusters data sets with continuous, integer, and ordinal variables (all having a cumulative distribution function) by considering the intra-component dependencies in a similar way to the Gaussian mixture. Indeed, each component of the Gaussian copula mixture produces a correlation coefficient for each pair of variables and its univariate margins follow standard distributions (Gaussian, Poisson, and ordered multinomial) depending on the nature of the variable (continuous, integer, or ordinal). As an interesting by-product, this model generalizes many well-known approaches and provides tools for visualization based on its parameters. The Bayesian inference is achieved with a Metropolis-within-Gibbs sampler. The numerical experiments, on simulated and real data, illustrate the benefits of the proposed model: flexible and meaningful parameterization combined with visualization features. 相似文献

18.

Clustering objects on subsets of attributes (with discussion)

Jerome H. Friedman Jacqueline J. Meulman 《Journal of the Royal Statistical Society. Series B, Statistical methodology》2004,66(4):815-849

Summary. A new procedure is proposed for clustering attribute value data. When used in conjunction with conventional distance-based clustering algorithms this procedure encourages those algorithms to detect automatically subgroups of objects that preferentially cluster on subsets of the attribute variables rather than on all of them simultaneously. The relevant attribute subsets for each individual cluster can be different and partially (or completely) overlap with those of other clusters. Enhancements for increasing sensitivity for detecting especially low cardinality groups clustering on a small subset of variables are discussed. Applications in different domains, including gene expression arrays, are presented. 相似文献

19.

Effects of some design factors on the distribution of similarity indices in cluster analysis

Ahmed N. Albatineh Hafiz M. R. Khan Bashar Zogheib Golam B. M. Kibria 《统计学通讯:模拟与计算》2017,46(5):4018-4034

This article investigates the effects of number of clusters, cluster size, and correction for chance agreement on the distribution of two similarity indices, namely, Jaccard and Rand indices. Skewness and kurtosis are calculated for the two indices and their corrected forms then compared with those of the normal distribution. Three clustering algorithms are implemented: complete linkage, Ward, and K-means. Data were randomly generated from bivariate normal distributions with specified means and variance covariance matrices. Three-way ANOVA is performed to assess the significance of the design factors using skewness and kurtosis of the indices as responses. Test statistics for testing skewness and kurtosis and observed power are calculated. Simulation results showed that independent of the clustering algorithms or the similarity indices used, the interaction effect cluster size x number of clusters and the main effects of cluster size and number of clusters were found always significant for skewness and kurtosis. The three way interaction of cluster size x correction x number of clusters was significant for skewness of Rand and Jaccard indices using all clustering algorithms, but was not significant using Ward's method for both Rand and Jaccard indices, while significant for Jaccard only using complete linkage and K-means algorithms. The correction for chance agreement was significant for skewness and kurtosis using Rand and Jaccard indices when complete linkage method is used. Hence, such design factors must be taken into consideration when studying distribution of such indices. 相似文献

20.

Self-updating clustering algorithm for estimating the parameters in mixtures of von Mises distributions

Wen-Liang Hung Shou-Jen Chang-Chien Miin-Shen Yang 《Journal of applied statistics》2012,39(10):2259-2274

The EM algorithm is the standard method for estimating the parameters in finite mixture models. Yang and Pan [25] proposed a generalized classification maximum likelihood procedure, called the fuzzy c-directions (FCD) clustering algorithm, for estimating the parameters in mixtures of von Mises distributions. Two main drawbacks of the EM algorithm are its slow convergence and the dependence of the solution on the initial value used. The choice of initial values is of great importance in the algorithm-based literature as it can heavily influence the speed of convergence of the algorithm and its ability to locate the global maximum. On the other hand, the algorithmic frameworks of EM and FCD are closely related. Therefore, the drawbacks of FCD are the same as those of the EM algorithm. To resolve these problems, this paper proposes another clustering algorithm, which can self-organize local optimal cluster numbers without using cluster validity functions. These numerical results clearly indicate that the proposed algorithm is superior in performance of EM and FCD algorithms. Finally, we apply the proposed algorithm to two real data sets. 相似文献