首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
《统计学通讯:理论与方法》2012,41(16-17):3211-3232
The analysis of microarray data is a widespread functional genomics approach that allows for the monitoring of the expression of thousands of genes at once. The analysis of the great amount of data generated in a microarray experiment requires powerful statistical techniques. One of the first tasks of the analysis of microarray data is to cluster data into biologically meaningful groups according to their expression patterns. In this article, we discuss classical as well as recent clustering techniques for microarray data. We pay particular attention to both theoretical and practical issues and give some general indications that might be useful to practitioners.  相似文献   

2.
基于遗传算法的投影寻踪聚类   总被引:2,自引:0,他引:2  
传统的投影寻踪聚类算法PROCLUS是一种有效的处理高维数据聚类的算法,但此算法是利用爬山法(Hill climbing)对各类中心点进行循环迭代、选取最优的过程,由于爬山法是一种局部搜索(local search)方法,得到的最优解可能仅仅是局部最优。针对上述缺陷,提出一种改进的投影寻踪聚类算法,即利用遗传算法(Genetic Algorithm)对各类中心点进行循环迭代,寻找到全局最优解。仿真实验结果证明了新算法的可行性和有效性。  相似文献   

3.
We propose two probability-like measures of individual cluster-membership certainty that can be applied to a hard partition of the sample such as that obtained from the partitioning around medoids (PAM) algorithm, hierarchical clustering or k-means clustering. One measure extends the individual silhouette widths and the other is obtained directly from the pairwise dissimilarities in the sample. Unlike the classic silhouette, however, the measures behave like probabilities and can be used to investigate an individual’s tendency to belong to a cluster. We also suggest two possible ways to evaluate the hard partition using these measures. We evaluate the performance of both measures in individuals with ambiguous cluster membership, using simulated binary datasets that have been partitioned by the PAM algorithm or continuous datasets that have been partitioned by hierarchical clustering and k-means clustering. For comparison, we also present results from soft-clustering algorithms such as soft analysis clustering (FANNY) and two model-based clustering methods. Our proposed measures perform comparably to the posterior probability estimators from either FANNY or the model-based clustering methods. We also illustrate the proposed measures by applying them to Fisher’s classic dataset on irises.  相似文献   

4.
5.
Mixtures of factor analyzers is a useful model-based clustering method which can avoid the curse of dimensionality in high-dimensional clustering. However, this approach is sensitive to both diverse non-normalities of marginal variables and outliers, which are commonly observed in multivariate experiments. We propose mixtures of Gaussian copula factor analyzers (MGCFA) for clustering high-dimensional clustering. This model has two advantages; (1) it allows different marginal distributions to facilitate fitting flexibility of the mixture model, (2) it can avoid the curse of dimensionality by embedding the factor-analytic structure in the component-correlation matrices of the mixture distribution.An EM algorithm is developed for the fitting of MGCFA. The proposed method is free of the curse of dimensionality and allows any parametric marginal distribution which fits best to the data. It is applied to both synthetic data and a microarray gene expression data for clustering and shows its better performance over several existing methods.  相似文献   

6.
We consider n individuals described by p variables, represented by points of the surface of unit hypersphere. We suppose that the individuals are fixed and the set of variables comes from a mixture of bipolar Watson distributions. For the mixture identification, we use EM and dynamic clusters algorithms, which enable us to obtain a partition of the set of variables into clusters of variables.

Our aim is to evaluate the clusters obtained in these algorithms, using measures of within-groups variability and between-groups variability and compare these clusters with those obtained in other clustering approaches, by analyzing simulated and real data.  相似文献   

7.
In this article, using longitudinal data, we develop the theory of credibility by copula model. The convex combination of copulas is used to describe the dependencies among claims. Finally, for comparing with the results of a single copula, using EM algorithm, some simulations of Massachusetts automobile claims are presented.  相似文献   

8.
The Hidden semi-Markov models (HSMMs) were introduced to overcome the constraint of a geometric sojourn time distribution for the different hidden states in the classical hidden Markov models. Several variations of HSMMs were proposed that model the sojourn times by a parametric or a nonparametric family of distributions. In this article, we concentrate our interest on the nonparametric case where the duration distributions are attached to transitions and not to states as in most of the published papers in HSMMs. Therefore, it is worth noticing that here we treat the underlying hidden semi-Markov chain in its general probabilistic structure. In that case, Barbu and Limnios (2008 Barbu , V. , Limnios , N. ( 2008 ). Semi-Markov Chains and Hidden Semi-Markov Models Toward Applications: Their Use in Reliability and DNA Analysis . New York : Springer . [Google Scholar]) proposed an Expectation–Maximization (EM) algorithm in order to estimate the semi-Markov kernel and the emission probabilities that characterize the dynamics of the model. In this article, we consider an improved version of Barbu and Limnios' EM algorithm which is faster than the original one. Moreover, we propose a stochastic version of the EM algorithm that achieves comparable estimates with the EM algorithm in less execution time. Some numerical examples are provided which illustrate the efficient performance of the proposed algorithms.  相似文献   

9.
数据分布密度划分的聚类算法是数据挖掘聚类算法的主要方法之一。针对传统密度划分聚类算法存在运算复杂、运行效率不高等缺陷,设计高维分步投影的多重分区聚类算法;以高维分布投影密度为依据,对数据集进行多重分区,产生数据集的子簇空间,并进行子簇合并,形成理想的聚类结果;依据该算法进行实验,结果证明该算法具有运算简单和运行效率高等优良性。  相似文献   

10.
Cluster analysis is a popular statistics and computer science technique commonly used in various areas of research. In this article, we investigate factors that can influence clustering performance in the model-based clustering framework. The four factors considered are the level of overlap, number of clusters, number of dimensions, and sample size. Through a comprehensive simulation study, we investigate model-based clustering in different settings. As a measure of clustering performance, we employ three popular classification indices capable of reflecting the degree of agreement in two partitioning vectors, thus making the comparison between the true and estimated classification vectors possible. In addition to studying clustering complexity, the performance of the three classification measures is evaluated.  相似文献   

11.
The accelerated failuretime (AFT) model is an important alternative to the Cox proportionalhazards model (PHM) in survival analysis. For multivariate failuretime data we propose to use frailties to explicitly account forpossible correlations (and heterogeneity) among failure times.An EM-like algorithm analogous to that in the frailty model forthe Cox model is adapted. Through simulation it is shown thatits performance compares favorably with that of the marginalindependence approach. For illustration we reanalyze a real dataset.  相似文献   

12.
We consider the problem of change-point in a classical framework while assuming a probability distribution for the change-point. An EM algorithm is proposed to estimate the distribution of the change-point. A change-point model for multiple profiles is also proposed, and EM algorithm is presented to estimate the model. Two examples of Illinois traffic data and Dow Jones Industrial Averages are used to demonstrate the proposed methods.  相似文献   

13.
In this article, a general approach to latent variable models based on an underlying generalized linear model (GLM) with factor analysis observation process is introduced. We call these models Generalized Linear Factor Models (GLFM). The observations are produced from a general model framework that involves observed and latent variables that are assumed to be distributed in the exponential family. More specifically, we concentrate on situations where the observed variables are both discretely measured (e.g., binomial, Poisson) and continuously distributed (e.g., gamma). The common latent factors are assumed to be independent with a standard multivariate normal distribution. Practical details of training such models with a new local expectation-maximization (EM) algorithm, which can be considered as a generalized EM-type algorithm, are also discussed. In conjunction with an approximated version of the Fisher score algorithm (FSA), we show how to calculate maximum likelihood estimates of the model parameters, and to yield inferences about the unobservable path of the common factors. The methodology is illustrated by an extensive Monte Carlo simulation study and the results show promising performance.  相似文献   

14.
传统的分层模型假设组与组之间独立,没有考虑组之间的相关性。而以地理单元分组的数据往往具有空间依赖性,个体不仅受本地区的影响,也可能受相邻地区的影响。此时,传统分层模型层-2残差分布的假设不再成立。为了处理空间分层数据,将空间统计和空间计量经济模型的思想引入到分层模型中,既纳入分层的思想,又顾及空间相关性,提出了空间分层线性模型,并给出了其固定效应、方差协方差成分和空间回归参数的最大似然估计,在运用EM算法时,结合运用了Fisher得分算法。  相似文献   

15.
We propose a method for estimating parameters in generalized linear models when the outcome variable is missing for some subjects and the missing data mechanism is non-ignorable. We assume throughout that the covariates are fully observed. One possible method for estimating the parameters is maximum likelihood with a non-ignorable missing data model. However, caution must be used when fitting non-ignorable missing data models because certain parameters may be inestimable for some models. Instead of fitting a non-ignorable model, we propose the use of auxiliary information in a likelihood approach to reduce the bias, without having to specify a non-ignorable model. The method is applied to a mental health study.  相似文献   

16.
Mixed-Weibull distribution has been used to model a wide range of failure data sets, and in many practical situations the number of components in a mixture model is unknown. Thus, the parameter estimation of a mixed-Weibull distribution is considered and the important issue of how to determine the number of components is discussed. Two approaches are proposed to solve this problem. One is the method of moments and the other is a regularization type of fuzzy clustering algorithm. Finally, numerical examples and two real data sets are given to illustrate the features of the proposed approaches.  相似文献   

17.
We consider bivariate current status data with death which often occur in animal tumorigenicity experiments. Instead of observing exact tumor onset time, the existence of tumor is known at death time or sacrifice time. Such an incomplete data structure makes it difficult to investigate the effect of treatment on tumor onset times. Furthermore, when tumor onsets occur at two sites, information for the order of their onsets is unknown. A multistate model is applied to incorporate the sequential occurrence of events. For the inference of parameters, an EM algorithm is applied and a real NTP (National Toxicology Program) dataset is analyzed as an illustrative example.  相似文献   

18.
In this paper, we present a new algorithm for clustering proximity-relation matrix that does not require the transitivity property. The proposed algorithm is first inspired by the idea of Yang and Wu [16] then turned into a self-organizing process that is built upon the intuition behind clustering. At the end of the process subjects belonging to be the same cluster should converge to the same point, which represents the cluster center. However, the performance of Yang and Wu's algorithm depends on parameter selection. In this paper, we use the partition entropy (PE) index to choose it. Numerical result illustrates that the proposed method does not only solve the parameter selection problem but also obtains an optimal clustering result. Finally, we apply the proposed algorithm to three applications. One is to evaluate the performance of higher education in Taiwan, another is machine–parts grouping in cellular manufacturing systems, and the other is to cluster probability density functions.  相似文献   

19.
This article focuses on data analyses under the scenario of missing at random within discrete-time Markov chain models. The naive method, nonlinear (NL) method, and Expectation-Maximization (EM) algorithm are discussed. We extend the NL method into a Bayesian framework, using an adjusted rejection algorithm to sample the posterior distribution, and estimating the transition probabilities with a Monte Carlo algorithm. We compare the Bayesian nonlinear (BNL) method with the naive method and the EM algorithm with various missing rates, and comprehensively evaluate estimators in terms of biases, variances, mean square errors, and coverage probabilities (CPs). Our simulation results show that the EM algorithm usually offers smallest variances but with poorest CP, while the BNL method has smaller variances and better/similar CP as compared to the naive method. When the missing rate is low (about 9%, MAR), the three methods are comparable. Whereas when the missing rate is high (about 25%, MAR), overall, the BNL method performs slightly but consistently better than the naive method regarding variances and CP. Data from a longitudinal study of stress level among caregivers of individuals with Alzheimer’s disease is used to illustrate these methods.  相似文献   

20.
We propose a latent variable model for informative missingness in longitudinal studies which is an extension of latent dropout class model. In our model, the value of the latent variable is affected by the missingness pattern and it is also used as a covariate in modeling the longitudinal response. So the latent variable links the longitudinal response and the missingness process. In our model, the latent variable is continuous instead of categorical and we assume that it is from a normal distribution. The EM algorithm is used to obtain the estimates of the parameter we are interested in and Gauss–Hermite quadrature is used to approximate the integration of the latent variable. The standard errors of the parameter estimates can be obtained from the bootstrap method or from the inverse of the Fisher information matrix of the final marginal likelihood. Comparisons are made to the mixed model and complete-case analysis in terms of a clinical trial dataset, which is Weight Gain Prevention among Women (WGPW) study. We use the generalized Pearson residuals to assess the fit of the proposed latent variable model.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号