首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 938 毫秒
1.
Clustering algorithms are used in the analysis of gene expression data to identify groups of genes with similar expression patterns. These algorithms group genes with respect to a predefined dissimilarity measure without using any prior classification of the data. Most of the clustering algorithms require the number of clusters as input, and all the objects in the dataset are usually assigned to one of the clusters. We propose a clustering algorithm that finds clusters sequentially, and allows for sporadic objects, so there are objects that are not assigned to any cluster. The proposed sequential clustering algorithm has two steps. First it finds candidates for centers of clusters. Multiple candidates are used to make the search for clusters more efficient. Secondly, it conducts a local search around the candidate centers to find the set of objects that defines a cluster. The candidate clusters are compared using a predefined score, the best cluster is removed from data, and the procedure is repeated. We investigate the performance of this algorithm using simulated data and we apply this method to analyze gene expression profiles in a study on the plasticity of the dendritic cells.  相似文献   

2.

Kaufman and Rousseeuw (1990) proposed a clustering algorithm Partitioning Around Medoids (PAM) which maps a distance matrix into a specified number of clusters. A particularly nice property is that PAM allows clustering with respect to any specified distance metric. In addition, the medoids are robust representations of the cluster centers, which is particularly important in the common context that many elements do not belong well to any cluster. Based on our experience in clustering gene expression data, we have noticed that PAM does have problems recognizing relatively small clusters in situations where good partitions around medoids clearly exist. In this paper, we propose to partition around medoids by maximizing a criteria "Average Silhouette" defined by Kaufman and Rousseeuw (1990). We also propose a fast-to-compute approximation of "Average Silhouette". We implement these two new partitioning around medoids algorithms and illustrate their performance relative to existing partitioning methods in simulations.  相似文献   

3.
Clustering gene expression data are an important step in providing information to biologists. A Bayesian clustering procedure using Fourier series with a Dirichlet process prior for clusters was developed. As an optimal computational tool for this Bayesian approach, Gibbs sampling of a normal mixture with a Dirichlet process was implemented to calculate the posterior probabilities when the number of clusters was unknown. Monte Carlo study results showed that the model was useful for suitable clustering. The proposed method was applied to the budding yeast Saccaromyces cerevisiae and provided biologically interpretable results.  相似文献   

4.
Health technology assessment often requires the evaluation of interventions which are implemented at the level of the health service organization unit (e.g. GP practice) for clusters of individuals. In a cluster randomized controlled trial (cRCT), clusters of patients are randomized; not each patient individually.

The majority of statistical analyses, in individually RCT, assume that the outcomes on different patients are independent. In cRCTs there is doubt about the validity of this assumption as the outcomes of patients, in the same cluster, may be correlated. Hence, the analysis of data from cRCTs presents a number of difficulties. The aim of this paper is to describe the statistical methods of adjusting for clustering, in the context of cRCTs.

There are essentially four approaches to analysing cRCTs: 1. Cluster-level analysis using aggregate summary data.

2. Regression analysis with robust standard errors.

3. Random-effects/cluster-specific approach.

4. Marginal/population-averaged approach.

This paper will compare and contrast the four approaches, using example data, with binary and continuous outcomes, from a cRCT designed to evaluate the effectiveness of training Health Visitors in psychological approaches to identify post-natal depressive symptoms and support post-natal women compared with usual care. The PoNDER Trial randomized 101 clusters (GP practices) and collected data on 2659 new mothers with an 18-month follow-up.  相似文献   

5.
The article by Müller, Quintana, and Page reviews a variety of Bayesian nonparametric models and demonstrates them in a few applications. They emphasize applications in spatial data on which our discussion focuses as well. In particular, we consider two types of mixture models based on species sampling models (SSM) for spatial clustering and apply them to the Chilean mathematics testing score data analyzed by the authors. We conclude that only the mixture model of SSM with spatial locations as part of observations renders spatially non-overlapping clusters.  相似文献   

6.
Detecting local spatial clusters for count data is an important task in spatial epidemiology. Two broad approaches—moving window and disease mapping methods—have been suggested in some of the literature to find clusters. However, the existing methods employ somewhat arbitrarily chosen tuning parameters, and the local clustering results are sensitive to the choices. In this paper, we propose a penalized likelihood method to overcome the limitations of existing local spatial clustering approaches for count data. We start with a Poisson regression model to accommodate any type of covariates, and formulate the clustering problem as a penalized likelihood estimation problem to find change points of intercepts in two-dimensional space. The cost of developing a new algorithm is minimized by modifying an existing least absolute shrinkage and selection operator algorithm. The computational details on the modifications are shown, and the proposed method is illustrated with Seoul tuberculosis data.  相似文献   

7.
Model-based clustering for social networks   总被引:5,自引:0,他引:5  
Summary.  Network models are widely used to represent relations between interacting units or actors. Network data often exhibit transitivity, meaning that two actors that have ties to a third actor are more likely to be tied than actors that do not, homophily by attributes of the actors or dyads, and clustering. Interest often focuses on finding clusters of actors or ties, and the number of groups in the data is typically unknown. We propose a new model, the latent position cluster model , under which the probability of a tie between two actors depends on the distance between them in an unobserved Euclidean 'social space', and the actors' locations in the latent social space arise from a mixture of distributions, each corresponding to a cluster. We propose two estimation methods: a two-stage maximum likelihood method and a fully Bayesian method that uses Markov chain Monte Carlo sampling. The former is quicker and simpler, but the latter performs better. We also propose a Bayesian way of determining the number of clusters that are present by using approximate conditional Bayes factors. Our model represents transitivity, homophily by attributes and clustering simultaneously and does not require the number of clusters to be known. The model makes it easy to simulate realistic networks with clustering, which are potentially useful as inputs to models of more complex systems of which the network is part, such as epidemic models of infectious disease. We apply the model to two networks of social relations. A free software package in the R statistical language, latentnet, is available to analyse data by using the model.  相似文献   

8.
Abstract

A nonparametric procedure is proposed to estimate multiple change-points of location changes in a univariate data sequence by using ranks instead of the raw data. While existing rank-based multiple change-point detection methods are mostly based on sequential tests, we treat it as a model selection problem. We derive the corresponding Schwarz’s information criterion for rank-statistics, theoretically prove the consistency of the change-point estimator and use a pruned dynamic programing algorithm to achieve the change-point estimator. Simulation studies show our method’s robustness, effectiveness and efficiency in detecting mean-changes. We also apply the method to a gene dataset as an illustration.  相似文献   

9.
The forward search is a method of robust data analysis in which outlier free subsets of the data of increasing size are used in model fitting; the data are then ordered by closeness to the model. Here the forward search, with many random starts, is used to cluster multivariate data. These random starts lead to the diagnostic identification of tentative clusters. Application of the forward search to the proposed individual clusters leads to the establishment of cluster membership through the identification of non-cluster members as outlying. The method requires no prior information on the number of clusters and does not seek to classify all observations. These properties are illustrated by the analysis of 200 six-dimensional observations on Swiss banknotes. The importance of linked plots and brushing in elucidating data structures is illustrated. We also provide an automatic method for determining cluster centres and compare the behaviour of our method with model-based clustering. In a simulated example with eight clusters our method provides more stable and accurate solutions than model-based clustering. We consider the computational requirements of both procedures.  相似文献   

10.
ABSTRACT

Recently, the Bayesian nonparametric approaches in survival studies attract much more attentions. Because of multimodality in survival data, the mixture models are very common. We introduce a Bayesian nonparametric mixture model with Burr distribution (Burr type XII) as the kernel. Since the Burr distribution shares good properties of common distributions on survival analysis, it has more flexibility than other distributions. By applying this model to simulated and real failure time datasets, we show the preference of this model and compare it with Dirichlet process mixture models with different kernels. The Markov chain Monte Carlo (MCMC) simulation methods to calculate the posterior distribution are used.  相似文献   

11.
ABSTRACT

Joint models are statistical tools for estimating the association between time-to-event and longitudinal outcomes. One challenge to the application of joint models is its computational complexity. Common estimation methods for joint models include a two-stage method, Bayesian and maximum-likelihood methods. In this work, we consider joint models of a time-to-event outcome and multiple longitudinal processes and develop a maximum-likelihood estimation method using the expectation–maximization algorithm. We assess the performance of the proposed method via simulations and apply the methodology to a data set to determine the association between longitudinal systolic and diastolic blood pressure measures and time to coronary artery disease.  相似文献   

12.
Statistical inference in the wavelet domain remains a vibrant area of contemporary statistical research because of desirable properties of wavelet representations and the need of scientific community to process, explore, and summarize massive data sets. Prime examples are biomedical, geophysical, and internet related data. We propose two new approaches to wavelet shrinkage/thresholding.

In the spirit of Efron and Tibshirani's recent work on local false discovery rate, we propose Bayesian Local False Discovery Rate (BLFDR), where the underlying model on wavelet coefficients does not assume known variances. This approach to wavelet shrinkage is shown to be connected with shrinkage based on Bayes factors. The second proposal, Bayesian False Discovery Rate (BaFDR), is based on ordering of posterior probabilities of hypotheses on true wavelets coefficients being null, in Bayesian testing of multiple hypotheses.

We demonstrate that both approaches result in competitive shrinkage methods by contrasting them to some popular shrinkage techniques.  相似文献   

13.
ABSTRACT

This paper considers adaptation of hierarchical models for small area disease counts to detect disease clustering. A high risk area may be an outlier (in local terms) if surrounded by low risk areas, whereas a high risk cluster requires that both the focus area and surrounding areas demonstrate common elevated risk. A local join count method is suggested to detect local clustering of high disease risk in a single health outcome, and extends to assessing bivariate spatial clustering in relative risk. Applications include assessing spatial heterogeneity in effects of area predictors according to local clustering configuration, and gauging sensitivity of bivariate clustering to random effect assumptions.  相似文献   

14.
Abstract

Cluster analysis is the distribution of objects into different groups or more precisely the partitioning of a data set into subsets (clusters) so that the data in subsets share some common trait according to some distance measure. Unlike classification, in clustering one has to first decide the optimum number of clusters and then assign the objects into different clusters. Solution of such problems for a large number of high dimensional data points is quite complicated and most of the existing algorithms will not perform properly. In the present work a new clustering technique applicable to large data set has been used to cluster the spectra of 702248 galaxies and quasars having 1,540 points in wavelength range imposed by the instrument. The proposed technique has successfully discovered five clusters from this 702,248X1,540 data matrix.  相似文献   

15.
Spectral clustering uses eigenvectors of the Laplacian of the similarity matrix. It is convenient to solve binary clustering problems. When applied to multi-way clustering, either the binary spectral clustering is recursively applied or an embedding to spectral space is done and some other methods, such as K-means clustering, are used to cluster the points. Here we propose and study a K-way clustering algorithm – spectral modular transformation, based on the fact that the graph Laplacian has an equivalent representation, which has a diagonal modular structure. The method first transforms the original similarity matrix into a new one, which is nearly disconnected and reveals a cluster structure clearly, then we apply linearized cluster assignment algorithm to split the clusters. In this way, we can find some samples for each cluster recursively using the divide and conquer method. To get the overall clustering results, we apply the cluster assignment obtained in the previous step as the initialization of multiplicative update method for spectral clustering. Examples show that our method outperforms spectral clustering using other initializations.  相似文献   

16.
Shi  Yushu  Laud  Purushottam  Neuner  Joan 《Lifetime data analysis》2021,27(1):156-176

In this paper, we first propose a dependent Dirichlet process (DDP) model using a mixture of Weibull models with each mixture component resembling a Cox model for survival data. We then build a Dirichlet process mixture model for competing risks data without regression covariates. Next we extend this model to a DDP model for competing risks regression data by using a multiplicative covariate effect on subdistribution hazards in the mixture components. Though built on proportional hazards (or subdistribution hazards) models, the proposed nonparametric Bayesian regression models do not require the assumption of constant hazard (or subdistribution hazard) ratio. An external time-dependent covariate is also considered in the survival model. After describing the model, we discuss how both cause-specific and subdistribution hazard ratios can be estimated from the same nonparametric Bayesian model for competing risks regression. For use with the regression models proposed, we introduce an omnibus prior that is suitable when little external information is available about covariate effects. Finally we compare the models’ performance with existing methods through simulations. We also illustrate the proposed competing risks regression model with data from a breast cancer study. An R package “DPWeibull” implementing all of the proposed methods is available at CRAN.

  相似文献   

17.
ABSTRACT

In a changing climate, changes in timing of seasonal events such as floods and flowering should be assessed using circular methods. Six different methods for clustering on a circle and one linear method are compared across different locations, spreads, and sample sizes. Best results are obtained when clusters are well separated and the number of observations in each cluster is approximately equal. Simulations of flood-like distributions are used to assess and explore clustering methods. Generally, k-means provides results that are close to the expected results, some other methods perform well under specific conditions, but no single method is exemplary.  相似文献   

18.
The Bayesian approach to inference stands out for naturally allowing borrowing information across heterogeneous populations, with different samples possibly sharing the same distribution. A popular Bayesian nonparametric model for clustering probability distributions is the nested Dirichlet process, which however has the drawback of grouping distributions in a single cluster when ties are observed across samples. With the goal of achieving a flexible and effective clustering method for both samples and observations, we investigate a nonparametric prior that arises as the composition of two different discrete random structures and derive a closed-form expression for the induced distribution of the random partition, the fundamental tool regulating the clustering behavior of the model. On the one hand, this allows to gain a deeper insight into the theoretical properties of the model and, on the other hand, it yields an MCMC algorithm for evaluating Bayesian inferences of interest. Moreover, we single out limitations of this algorithm when working with more than two populations and, consequently, devise an alternative more efficient sampling scheme, which as a by-product, allows testing homogeneity between different populations. Finally, we perform a comparison with the nested Dirichlet process and provide illustrative examples of both synthetic and real data.  相似文献   

19.
Bandwidth plays an important role in determining the performance of nonparametric estimators, such as the local constant estimator. In this article, we propose a Bayesian approach to bandwidth estimation for local constant estimators of time-varying coefficients in time series models. We establish a large sample theory for the proposed bandwidth estimator and Bayesian estimators of the unknown parameters involved in the error density. A Monte Carlo simulation study shows that (i) the proposed Bayesian estimators for bandwidth and parameters in the error density have satisfactory finite sample performance; and (ii) our proposed Bayesian approach achieves better performance in estimating the bandwidths than the normal reference rule and cross-validation. Moreover, we apply our proposed Bayesian bandwidth estimation method for the time-varying coefficient models that explain Okun’s law and the relationship between consumption growth and income growth in the U.S. For each model, we also provide calibrated parametric forms of the time-varying coefficients. Supplementary materials for this article are available online.  相似文献   

20.
A network cluster is defined as a set of nodes with ‘strong’ within group ties and ‘weak’ between group ties. Most clustering methods focus on finding groups of ‘densely connected’ nodes, where the dyad (or tie between two nodes) serves as the building block for forming clusters. However, since the unweighted dyad cannot distinguish strong relationships from weak ones, it then seems reasonable to consider an alternative building block, i.e. one involving more than two nodes. In the simplest case, one can consider the triad (or three nodes), where the fully connected triad represents the basic unit of transitivity in an undirected network. In this effort we propose a clustering framework for finding highly transitive subgraphs in an undirected/unweighted network, where the fully connected triad (or triangle configuration) is used as the building block for forming clusters. We apply our methodology to four real networks with encouraging results. Monte Carlo simulation results suggest that, on average, the proposed method yields good clustering performance on synthetic benchmark graphs, relative to other popular methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号