首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
Label switching is one of the fundamental issues for Bayesian mixture modeling. It occurs due to the nonidentifiability of the components under symmetric priors. Without solving the label switching, the ergodic averages of component specific quantities will be identical and thus useless for inference relating to individual components, such as the posterior means, predictive component densities, and marginal classification probabilities. The author establishes the equivalence between the labeling and clustering and proposes two simple clustering criteria to solve the label switching. The first method can be considered as an extension of K-means clustering. The second method is to find the labels by minimizing the volume of labeled samples and this method is invariant to the scale transformation of the parameters. Using a simulation example and the application of two real data sets, the author demonstrates the success of these new methods in dealing with the label switching problem.  相似文献   

Effectively solving the label switching problem is critical for both Bayesian and Frequentist mixture model analyses. In this article, a new relabeling method is proposed by extending a recently developed modal clustering algorithm. First, the posterior distribution is estimated by a kernel density from permuted MCMC or bootstrap samples of parameters. Second, a modal EM algorithm is used to find the m! symmetric modes of the KDE. Finally, samples that ascend to the same mode are assigned the same label. Simulations and real data applications demonstrate that the new method provides more accurate estimates than many existing relabeling methods.  相似文献   

Empirical likelihood ratio confidence regions based on the chi-square calibration suffer from an undercoverage problem in that their actual coverage levels tend to be lower than the nominal levels. The finite sample distribution of the empirical log-likelihood ratio is recognized to have a mixture structure with a continuous component on [0, + ∞) and a point mass at + ∞. The undercoverage problem of the Chi-square calibration is partly due to its use of the continuous Chi-square distribution to approximate the mixture distribution of the empirical log-likelihood ratio. In this article, we propose two new methods of calibration which will take advantage of the mixture structure; we construct two new mixture distributions by using the F and chi-square distributions and use these to approximate the mixture distributions of the empirical log-likelihood ratio. The new methods of calibration are asymptotically equivalent to the chi-square calibration. But the new methods, in particular the F mixture based method, can be substantially more accurate than the chi-square calibration for small and moderately large sample sizes. The new methods are also as easy to use as the chi-square calibration.  相似文献   

Model based labeling for mixture models   总被引:1,自引:0,他引:1  
Label switching is one of the fundamental problems for Bayesian mixture model analysis. Due to the permutation invariance of the mixture posterior, we can consider that the posterior of a m-component mixture model is a mixture distribution with m! symmetric components and therefore the object of labeling is to recover one of the components. In order to do labeling, we propose to first fit a symmetric m!-component mixture model to the Markov chain Monte Carlo (MCMC) samples and then choose the label for each sample by maximizing the corresponding classification probabilities, which are the probabilities of all possible labels for each sample. Both parametric and semi-parametric ways are proposed to fit the symmetric mixture model for the posterior. Compared to the existing labeling methods, our proposed method aims to approximate the posterior directly and provides the labeling probabilities for all possible labels and thus has a model explanation and theoretical support. In addition, we introduce a situation in which the “ideally” labeled samples are available and thus can be used to compare different labeling methods. We demonstrate the success of our new method in dealing with the label switching problem using two examples.  相似文献   

Linear mixed models are widely used when multiple correlated measurements are made on each unit of interest. In many applications, the units may form several distinct clusters, and such heterogeneity can be more appropriately modelled by a finite mixture linear mixed model. The classical estimation approach, in which both the random effects and the error parts are assumed to follow normal distribution, is sensitive to outliers, and failure to accommodate outliers may greatly jeopardize the model estimation and inference. We propose a new mixture linear mixed model using multivariate t distribution. For each mixture component, we assume the response and the random effects jointly follow a multivariate t distribution, to conveniently robustify the estimation procedure. An efficient expectation conditional maximization algorithm is developed for conducting maximum likelihood estimation. The degrees of freedom parameters of the t distributions are chosen data adaptively, for achieving flexible trade-off between estimation robustness and efficiency. Simulation studies and an application on analysing lung growth longitudinal data showcase the efficacy of the proposed approach.  相似文献   

Cluster analysis is the automated search for groups of homogeneous observations in a data set. A popular modeling approach for clustering is based on finite normal mixture models, which assume that each cluster is modeled as a multivariate normal distribution. However, the normality assumption that each component is symmetric is often unrealistic. Furthermore, normal mixture models are not robust against outliers; they often require extra components for modeling outliers and/or give a poor representation of the data. To address these issues, we propose a new class of distributions, multivariate t distributions with the Box-Cox transformation, for mixture modeling. This class of distributions generalizes the normal distribution with the more heavy-tailed t distribution, and introduces skewness via the Box-Cox transformation. As a result, this provides a unified framework to simultaneously handle outlier identification and data transformation, two interrelated issues. We describe an Expectation-Maximization algorithm for parameter estimation along with transformation selection. We demonstrate the proposed methodology with three real data sets and simulation studies. Compared with a wealth of approaches including the skew-t mixture model, the proposed t mixture model with the Box-Cox transformation performs favorably in terms of accuracy in the assignment of observations, robustness against model misspecification, and selection of the number of components.  相似文献   

In modelling financial return time series and time-varying volatility, the Gaussian and the Student-t distributions are widely used in stochastic volatility (SV) models. However, other distributions such as the Laplace distribution and generalized error distribution (GED) are also common in SV modelling. Therefore, this paper proposes the use of the generalized t (GT) distribution whose special cases are the Gaussian distribution, Student-t distribution, Laplace distribution and GED. Since the GT distribution is a member of the scale mixture of uniform (SMU) family of distribution, we handle the GT distribution via its SMU representation. We show this SMU form can substantially simplify the Gibbs sampler for Bayesian simulation-based computation and can provide a mean of identifying outliers. In an empirical study, we adopt a GT–SV model to fit the daily return of the exchange rate of Australian dollar to three other currencies and use the exchange rate to US dollar as a covariate. Model implementation relies on Bayesian Markov chain Monte Carlo algorithms using the WinBUGS package.  相似文献   

This paper presents a new Bayesian, infinite mixture model based, clustering approach, specifically designed for time-course microarray data. The problem is to group together genes which have “similar” expression profiles, given the set of noisy measurements of their expression levels over a specific time interval. In order to capture temporal variations of each curve, a non-parametric regression approach is used. Each expression profile is expanded over a set of basis functions and the sets of coefficients of each curve are subsequently modeled through a Bayesian infinite mixture of Gaussian distributions. Therefore, the task of finding clusters of genes with similar expression profiles is then reduced to the problem of grouping together genes whose coefficients are sampled from the same distribution in the mixture. Dirichlet processes prior is naturally employed in such kinds of models, since it allows one to deal automatically with the uncertainty about the number of clusters. The posterior inference is carried out by a split and merge MCMC sampling scheme which integrates out parameters of the component distributions and updates only the latent vector of the cluster membership. The final configuration is obtained via the maximum a posteriori estimator. The performance of the method is studied using synthetic and real microarray data and is compared with the performances of competitive techniques.  相似文献   

Robust mixture modelling using the t distribution   总被引:2,自引:0,他引:2  
Normal mixture models are being increasingly used to model the distributions of a wide variety of random phenomena and to cluster sets of continuous multivariate data. However, for a set of data containing a group or groups of observations with longer than normal tails or atypical observations, the use of normal components may unduly affect the fit of the mixture model. In this paper, we consider a more robust approach by modelling the data by a mixture of t distributions. The use of the ECM algorithm to fit this t mixture model is described and examples of its use are given in the context of clustering multivariate data in the presence of atypical observations in the form of background noise.  相似文献   

Solving label switching is crucial for interpreting the results of fitting Bayesian mixture models. The label switching originates from the invariance of posterior distribution to permutation of component labels. As a result, the component labels in Markov chain simulation may switch to another equivalent permutation, and the marginal posterior distribution associated with all labels may be similar and useless for inferring quantities relating to each individual component. In this article, we propose a new simple labelling method by minimizing the deviance of the class probabilities to a fixed reference labels. The reference labels can be chosen before running Markov chain Monte Carlo (MCMC) using optimization methods, such as expectation-maximization algorithms, and therefore the new labelling method can be implemented by an online algorithm, which can reduce the storage requirements and save much computation time. Using the Acid data set and Galaxy data set, we demonstrate the success of the proposed labelling method for removing the labelling switching in the raw MCMC samples.  相似文献   

A new Markov chain Monte Carlo method for the Bayesian analysis of finite mixture distributions with an unknown number of components is presented. The sampler is characterized by a state space consisting only of the number of components and the latent allocation variables. Its main advantage is that it can be used, with minimal changes, for mixtures of components from any parametric family, under the assumption that the component parameters can be integrated out of the model analytically. Artificial and real data sets are used to illustrate the method and mixtures of univariate and of multivariate normals are explicitly considered. The problem of label switching, when parameter inference is of interest, is addressed in a post-processing stage.  相似文献   

Partially linear models (PLMs) are an important tool in modelling economic and biometric data and are considered as a flexible generalization of the linear model by including a nonparametric component of some covariate into the linear predictor. Usually, the error component is assumed to follow a normal distribution. However, the theory and application (through simulation or experimentation) often generate a great amount of data sets that are skewed. The objective of this paper is to extend the PLMs allowing the errors to follow a skew-normal distribution [A. Azzalini, A class of distributions which includes the normal ones, Scand. J. Statist. 12 (1985), pp. 171–178], increasing the flexibility of the model. In particular, we develop the expectation-maximization (EM) algorithm for linear regression models and diagnostic analysis via local influence as well as generalized leverage, following [H. Zhu and S. Lee, Local influence for incomplete-data models, J. R. Stat. Soc. Ser. B 63 (2001), pp. 111–126]. A simulation study is also conducted to evaluate the efficiency of the EM algorithm. Finally, a suitable transformation is applied in a data set on ragweed pollen concentration in order to fit PLMs under asymmetric distributions. An illustrative comparison is performed between normal and skew-normal errors.  相似文献   

This paper deals with the problem of maximum likelihood estimation for a mixture of skew Student-t-normal distributions, which is a novel model-based tool for clustering heterogeneous (multiple groups) data in the presence of skewed and heavy-tailed outcomes. We present two analytically simple EM-type algorithms for iteratively computing the maximum likelihood estimates. The observed information matrix is derived for obtaining the asymptotic standard errors of parameter estimates. A small simulation study is conducted to demonstrate the superiority of the skew Student-t-normal distribution compared to the skew t distribution. The proposed methodology is particularly useful for analyzing multimodal asymmetric data as produced by major biotechnological platforms like flow cytometry. We provide such an application with the help of an illustrative example.  相似文献   

We revisit the problem of estimating the proportion π of true null hypotheses where a large scale of parallel hypothesis tests are performed independently. While the proportion is a quantity of interest in its own right in applications, the problem has arisen in assessing or controlling an overall false discovery rate. On the basis of a Bayes interpretation of the problem, the marginal distribution of the p-value is modeled in a mixture of the uniform distribution (null) and a non-uniform distribution (alternative), so that the parameter π of interest is characterized as the mixing proportion of the uniform component on the mixture. In this article, a nonparametric exponential mixture model is proposed to fit the p-values. As an alternative approach to the convex decreasing mixture model, the exponential mixture model has the advantages of identifiability, flexibility, and regularity. A computation algorithm is developed. The new approach is applied to a leukemia gene expression data set where multiple significance tests over 3,051 genes are performed. The new estimate for π with the leukemia gene expression data appears to be about 10% lower than the other three estimates that are known to be conservative. Simulation results also show that the new estimate is usually lower and has smaller bias than the other three estimates.  相似文献   

A finite mixture model using the Student's t distribution has been recognized as a robust extension of normal mixtures. Recently, a mixture of skew normal distributions has been found to be effective in the treatment of heterogeneous data involving asymmetric behaviors across subclasses. In this article, we propose a robust mixture framework based on the skew t distribution to efficiently deal with heavy-tailedness, extra skewness and multimodality in a wide range of settings. Statistical mixture modeling based on normal, Student's t and skew normal distributions can be viewed as special cases of the skew t mixture model. We present analytically simple EM-type algorithms for iteratively computing maximum likelihood estimates. The proposed methodology is illustrated by analyzing a real data example.  相似文献   

It is generally assumed that the likelihood ratio statistic for testing the null hypothesis that data arise from a homoscedastic normal mixture distribution versus the alternative hypothesis that data arise from a heteroscedastic normal mixture distribution has an asymptotic χ 2 reference distribution with degrees of freedom equal to the difference in the number of parameters being estimated under the alternative and null models under some regularity conditions. Simulations show that the χ 2 reference distribution will give a reasonable approximation for the likelihood ratio test only when the sample size is 2000 or more and the mixture components are well separated when the restrictions suggested by Hathaway (Ann. Stat. 13:795–800, 1985) are imposed on the component variances to ensure that the likelihood is bounded under the alternative distribution. For small and medium sample sizes, parametric bootstrap tests appear to work well for determining whether data arise from a normal mixture with equal variances or a normal mixture with unequal variances.  相似文献   

The authors consider hidden Markov models (HMMs) whose latent process has m ≥ 2 states and whose state‐dependent distributions arise from a general one‐parameter family. They propose a test of the hypothesis m = 2. Their procedure is an extension to HMMs of the modified likelihood ratio statistic proposed by Chen, Chen & Kalbfleisch (2004) for testing two states in a finite mixture. The authors determine the asymptotic distribution of their test under the hypothesis m = 2 and investigate its finite‐sample properties in a simulation study. Their test is based on inference for the marginal mixture distribution of the HMM. In order to illustrate the additional difficulties due to the dependence structure of the HMM, they show how to test general regular hypotheses on the marginal mixture of HMMs via a quasi‐modified likelihood ratio. They also discuss two applications.  相似文献   

In a Bayesian analysis of finite mixture models, parameter estimation and clustering are sometimes less straightforward than might be expected. In particular, the common practice of estimating parameters by their posterior mean, and summarizing joint posterior distributions by marginal distributions, often leads to nonsensical answers. This is due to the so-called 'label switching' problem, which is caused by symmetry in the likelihood of the model parameters. A frequent response to this problem is to remove the symmetry by using artificial identifiability constraints. We demonstrate that this fails in general to solve the problem, and we describe an alternative class of approaches, relabelling algorithms , which arise from attempting to minimize the posterior expected loss under a class of loss functions. We describe in detail one particularly simple and general relabelling algorithm and illustrate its success in dealing with the label switching problem on two examples.  相似文献   

Extended Weibull type distribution and finite mixture of distributions   总被引:1,自引:0,他引:1  
An extended form of Weibull distribution is suggested which has two shape parameters (m and δ). Introduction of another shape parameter δ helps to express the extended Weibull distribution not only as an exact form of a mixture of distributions under certain conditions, but also provides extra flexibility to the density function over positive range. The shape of density function of the extended Weibull type distribution for various values of the parameters is shown which may be of some interest to Bayesians. Certain statistical properties such as hazard rate function, mean residual function, rth moment are defined explicitly. The proposed extended Weibull distribution is used to derive an exact form of two, three and k-component mixture of distributions. With the help of a real data set, the usefulness of mixture Weibull type distribution is illustrated by using Markov Chain Monte Carlo (MCMC), Gibbs sampling approach.  相似文献   

The normal and Laplace are the two earliest known continuous distributions in statistics and the two most popular models for analyzing symmetric data. In this note, the exact distribution of the ratio | X / Y | is derived when X and Y are respectively normal and Laplace random variables distributed independently of each other. A MAPLE program is provided for computing the associated percentage points. An application of the derived distribution is provided to a discriminant problem.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号