首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Model-based clustering is a method that clusters data with an assumption of a statistical model structure. In this paper, we propose a novel model-based hierarchical clustering method for a finite statistical mixture model based on the Fisher distribution. The main foci of the proposed method are: (a) provide efficient solution to estimate the parameters of a Fisher mixture model (FMM); (b) generate a hierarchy of FMMs and (c) select the optimal model. To this aim, we develop a Bregman soft clustering method for FMM. Our model estimation strategy exploits Bregman divergence and hierarchical agglomerative clustering. Whereas, our model selection strategy comprises a parsimony-based approach and an evaluation graph-based approach. We empirically validate our proposed method by applying it on simulated data. Next, we apply the method on real data to perform depth image analysis. We demonstrate that the proposed clustering method can be used as a potential tool for unsupervised depth image analysis.  相似文献   

2.
Clustering streaming data is gaining importance as automatic data acquisition technologies are deployed in diverse applications. We propose a fully incremental projected divisive clustering method for high-dimensional data streams that is motivated by high density clustering. The method is capable of identifying clusters in arbitrary subspaces, estimating the number of clusters, and detecting changes in the data distribution which necessitate a revision of the model. The empirical evaluation of the proposed method on numerous real and simulated datasets shows that it is scalable in dimension and number of clusters, is robust to noisy and irrelevant features, and is capable of handling a variety of types of non-stationarity.  相似文献   

3.
Probabilistic Expert Systems for Forensic Inference from Genetic Markers   总被引:3,自引:0,他引:3  
We present a number of real and fictitious examples in illustration of a new approach to analysing complex cases of forensic identification inference. This is effected by careful restructuring of the relevant pedigrees as a Probabilistic Expert System. Existing software can then be used to perform the required inferential calculations. Specific complications which are readily handled by this approach include missing data on one or more relevant individuals, and genetic mutation. The method is particularly valuable for disputed paternity cases, but applies also to certain criminal cases.  相似文献   

4.
This article focuses on the clustering problem based on Dirichlet process (DP) mixtures. To model both time invariant and temporal patterns, different from other existing clustering methods, the proposed semi-parametric model is flexible in that both the common and unique patterns are taken into account simultaneously. Furthermore, by jointly clustering subjects and the associated variables, the intrinsic complex shared patterns among subjects and among variables are expected to be captured. The number of clusters and cluster assignments are directly inferred with the use of DP. Simulation studies illustrate the effectiveness of the proposed method. An application to wheal size data is discussed with an aim of identifying novel temporal patterns among allergens within subject clusters.  相似文献   

5.
A nonparametric test for the presence of clustering in survival data is proposed. Assuming a model that incorporates the clustering effect into the Cox Proportional Hazards model, simulation studies indicate that the procedure is correctly sized and powerful in a reasonably wide range of scenarios. The test for the presence of clustering over time is also robust to model misspecification. With large number of clusters, the test is powerful even if the data is highly heterogeneous.  相似文献   

6.
Cross-validated likelihood is investigated as a tool for automatically determining the appropriate number of components (given the data) in finite mixture modeling, particularly in the context of model-based probabilistic clustering. The conceptual framework for the cross-validation approach to model selection is straightforward in the sense that models are judged directly on their estimated out-of-sample predictive performance. The cross-validation approach, as well as penalized likelihood and McLachlan's bootstrap method, are applied to two data sets and the results from all three methods are in close agreement. The second data set involves a well-known clustering problem from the atmospheric science literature using historical records of upper atmosphere geopotential height in the Northern hemisphere. Cross-validated likelihood provides an interpretable and objective solution to the atmospheric clustering problem. The clusters found are in agreement with prior analyses of the same data based on non-probabilistic clustering techniques.  相似文献   

7.
We consider Dirichlet process mixture models in which the observed clusters in any particular dataset are not viewed as belonging to a finite set of possible clusters but rather as representatives of a latent structure in which objects belong to one of a potentially infinite number of clusters. As more information is revealed the number of inferred clusters is allowed to grow. The precision parameter of the Dirichlet process is a crucial parameter that controls the number of clusters. We develop a framework for the specification of the hyperparameters associated with the prior for the precision parameter that can be used both in the presence or absence of subjective prior information about the level of clustering. Our approach is illustrated in an analysis of clustering brands at the magazine Which?. The results are compared with the approach of Dorazio (2009) via a simulation study.  相似文献   

8.
This paper addresses the problem of identifying groups that satisfy the specific conditions for the means of feature variables. In this study, we refer to the identified groups as “target clusters” (TCs). To identify TCs, we propose a method based on the normal mixture model (NMM) restricted by a linear combination of means. We provide an expectation–maximization (EM) algorithm to fit the restricted NMM by using the maximum-likelihood method. The convergence property of the EM algorithm and a reasonable set of initial estimates are presented. We demonstrate the method's usefulness and validity through a simulation study and two well-known data sets. The proposed method provides several types of useful clusters, which would be difficult to achieve with conventional clustering or exploratory data analysis methods based on the ordinary NMM. A simple comparison with another target clustering approach shows that the proposed method is promising in the identification.  相似文献   

9.
We incorporate a random clustering effect into the nonparametric version of Cox Proportional Hazards model to characterize clustered survival data. The simulation studies provide evidence that clustered survival data can be better characterized through a nonparametric model. Predictive accuracy of the nonparametric model is affected by number of clusters and distribution of the random component accounting for clustering effect. As the functional form of the covariate departs from linearity, the nonparametric model is becoming more advantageous over the parametric counterpart. Finally, nonparametric is better than parametric model when data are highly heterogenous and/or there is misspecification error.  相似文献   

10.
The forward search is a method of robust data analysis in which outlier free subsets of the data of increasing size are used in model fitting; the data are then ordered by closeness to the model. Here the forward search, with many random starts, is used to cluster multivariate data. These random starts lead to the diagnostic identification of tentative clusters. Application of the forward search to the proposed individual clusters leads to the establishment of cluster membership through the identification of non-cluster members as outlying. The method requires no prior information on the number of clusters and does not seek to classify all observations. These properties are illustrated by the analysis of 200 six-dimensional observations on Swiss banknotes. The importance of linked plots and brushing in elucidating data structures is illustrated. We also provide an automatic method for determining cluster centres and compare the behaviour of our method with model-based clustering. In a simulated example with eight clusters our method provides more stable and accurate solutions than model-based clustering. We consider the computational requirements of both procedures.  相似文献   

11.
This article proposes a new model for right‐censored survival data with multi‐level clustering based on the hierarchical Kendall copula model of Brechmann (2014) with Archimedean clusters. This model accommodates clusters of unequal size and multiple clustering levels, without imposing any structural conditions on the parameters or on the copulas used at various levels of the hierarchy. A step‐wise estimation procedure is proposed and shown to yield consistent and asymptotically Gaussian estimates under mild regularity conditions. The model fitting is based on multiple imputation, given that the censoring rate increases with the level of the hierarchy. To check the model assumption of Archimedean dependence, a goodness‐of test is developed. The finite‐sample performance of the proposed estimators and of the goodness‐of‐fit test is investigated through simulations. The new model is applied to data from the study of chronic granulomatous disease. The Canadian Journal of Statistics 47: 182–203; 2019 © 2019 Statistical Society of Canada  相似文献   

12.
Clustering algorithms are used in the analysis of gene expression data to identify groups of genes with similar expression patterns. These algorithms group genes with respect to a predefined dissimilarity measure without using any prior classification of the data. Most of the clustering algorithms require the number of clusters as input, and all the objects in the dataset are usually assigned to one of the clusters. We propose a clustering algorithm that finds clusters sequentially, and allows for sporadic objects, so there are objects that are not assigned to any cluster. The proposed sequential clustering algorithm has two steps. First it finds candidates for centers of clusters. Multiple candidates are used to make the search for clusters more efficient. Secondly, it conducts a local search around the candidate centers to find the set of objects that defines a cluster. The candidate clusters are compared using a predefined score, the best cluster is removed from data, and the procedure is repeated. We investigate the performance of this algorithm using simulated data and we apply this method to analyze gene expression profiles in a study on the plasticity of the dendritic cells.  相似文献   

13.
Clustering gene expression data are an important step in providing information to biologists. A Bayesian clustering procedure using Fourier series with a Dirichlet process prior for clusters was developed. As an optimal computational tool for this Bayesian approach, Gibbs sampling of a normal mixture with a Dirichlet process was implemented to calculate the posterior probabilities when the number of clusters was unknown. Monte Carlo study results showed that the model was useful for suitable clustering. The proposed method was applied to the budding yeast Saccaromyces cerevisiae and provided biologically interpretable results.  相似文献   

14.
Model-based clustering for social networks   总被引:5,自引:0,他引:5  
Summary.  Network models are widely used to represent relations between interacting units or actors. Network data often exhibit transitivity, meaning that two actors that have ties to a third actor are more likely to be tied than actors that do not, homophily by attributes of the actors or dyads, and clustering. Interest often focuses on finding clusters of actors or ties, and the number of groups in the data is typically unknown. We propose a new model, the latent position cluster model , under which the probability of a tie between two actors depends on the distance between them in an unobserved Euclidean 'social space', and the actors' locations in the latent social space arise from a mixture of distributions, each corresponding to a cluster. We propose two estimation methods: a two-stage maximum likelihood method and a fully Bayesian method that uses Markov chain Monte Carlo sampling. The former is quicker and simpler, but the latter performs better. We also propose a Bayesian way of determining the number of clusters that are present by using approximate conditional Bayes factors. Our model represents transitivity, homophily by attributes and clustering simultaneously and does not require the number of clusters to be known. The model makes it easy to simulate realistic networks with clustering, which are potentially useful as inputs to models of more complex systems of which the network is part, such as epidemic models of infectious disease. We apply the model to two networks of social relations. A free software package in the R statistical language, latentnet, is available to analyse data by using the model.  相似文献   

15.
Detecting local spatial clusters for count data is an important task in spatial epidemiology. Two broad approaches—moving window and disease mapping methods—have been suggested in some of the literature to find clusters. However, the existing methods employ somewhat arbitrarily chosen tuning parameters, and the local clustering results are sensitive to the choices. In this paper, we propose a penalized likelihood method to overcome the limitations of existing local spatial clustering approaches for count data. We start with a Poisson regression model to accommodate any type of covariates, and formulate the clustering problem as a penalized likelihood estimation problem to find change points of intercepts in two-dimensional space. The cost of developing a new algorithm is minimized by modifying an existing least absolute shrinkage and selection operator algorithm. The computational details on the modifications are shown, and the proposed method is illustrated with Seoul tuberculosis data.  相似文献   

16.
This paper gives a comparative study of the K-means algorithm and the mixture model (MM) method for clustering normal data. The EM algorithm is used to compute the maximum likelihood estimators (MLEs) of the parameters of the MM model. These parameters include mixing proportions, which may be thought of as the prior probabilities of different clusters; the maximum posterior (Bayes) rule is used for clustering. Hence, asymptotically the MM method approaches the Bayes rule for known parameters, which is optimal in terms of minimizing the expected misclassification rate (EMCR).  相似文献   

17.
The self-updating process (SUP) is a clustering algorithm that stands from the viewpoint of data points and simulates the process how data points move and perform self-clustering. It is an iterative process on the sample space and allows for both time-varying and time-invariant operators. By simulations and comparisons, this paper shows that SUP is particularly competitive in clustering (i) data with noise, (ii) data with a large number of clusters, and (iii) unbalanced data. When noise is present in the data, SUP is able to isolate the noise data points while performing clustering simultaneously. The property of the local updating enables SUP to handle data with a large number of clusters and data of various structures. In this paper, we showed that the blurring mean-shift is a static SUP. Therefore, our discussions on the strengths of SUP also apply to the blurring mean-shift.  相似文献   

18.

Kaufman and Rousseeuw (1990) proposed a clustering algorithm Partitioning Around Medoids (PAM) which maps a distance matrix into a specified number of clusters. A particularly nice property is that PAM allows clustering with respect to any specified distance metric. In addition, the medoids are robust representations of the cluster centers, which is particularly important in the common context that many elements do not belong well to any cluster. Based on our experience in clustering gene expression data, we have noticed that PAM does have problems recognizing relatively small clusters in situations where good partitions around medoids clearly exist. In this paper, we propose to partition around medoids by maximizing a criteria "Average Silhouette" defined by Kaufman and Rousseeuw (1990). We also propose a fast-to-compute approximation of "Average Silhouette". We implement these two new partitioning around medoids algorithms and illustrate their performance relative to existing partitioning methods in simulations.  相似文献   

19.
Reduced k‐means clustering is a method for clustering objects in a low‐dimensional subspace. The advantage of this method is that both clustering of objects and low‐dimensional subspace reflecting the cluster structure are simultaneously obtained. In this paper, the relationship between conventional k‐means clustering and reduced k‐means clustering is discussed. Conditions ensuring almost sure convergence of the estimator of reduced k‐means clustering as unboundedly increasing sample size have been presented. The results for a more general model considering conventional k‐means clustering and reduced k‐means clustering are provided in this paper. Moreover, a consistent selection of the numbers of clusters and dimensions is described.  相似文献   

20.
Block clustering with collapsed latent block models   总被引:1,自引:0,他引:1  
We introduce a Bayesian extension of the latent block model for model-based block clustering of data matrices. Our approach considers a block model where block parameters may be integrated out. The result is a posterior defined over the number of clusters in rows and columns and cluster memberships. The number of row and column clusters need not be known in advance as these are sampled along with cluster memberhips using Markov chain Monte Carlo. This differs from existing work on latent block models, where the number of clusters is assumed known or is chosen using some information criteria. We analyze both simulated and real data to validate the technique.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号