首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We use minimum message length (MML) estimation for mixture modelling. MML estimates are derived to choose the number of components in the mixture model to best describe the data and to estimate the parameters of the component densities for Gaussian mixture models. An empirical comparison of criteria prominent in the literature for estimating the number of components in a data set is performed. We have found that MML coding considerations allows the derivation of useful results to guide our implementation of a mixture modelling program. These advantages allow model search to be controlled based on the minimum variance for a component and the amount of data required to distinguish two overlapping components.  相似文献   

2.
We developed a flexible non-parametric Bayesian model for regional disease-prevalence estimation based on cross-sectional data that are obtained from several subpopulations or clusters such as villages, cities, or herds. The subpopulation prevalences are modeled with a mixture distribution that allows for zero prevalence. The distribution of prevalences among diseased subpopulations is modeled as a mixture of finite Polya trees. Inferences can be obtained for (1) the proportion of diseased subpopulations in a region, (2) the distribution of regional prevalences, (3) the mean and median prevalence in the region, (4) the prevalence of any sampled subpopulation, and (5) predictive distributions of prevalences for regional subpopulations not included in the study, including the predictive probability of zero prevalence. We focus on prevalence estimation using data from a single diagnostic test, but we also briefly discuss the scenario where two conditionally dependent (or independent) diagnostic tests are used. Simulated data demonstrate the utility of our non-parametric model over parametric analysis. An example involving brucellosis in cattle is presented.  相似文献   

3.
Summary.  Suppose that we have m repeated measures on each subject, and we model the observation vectors with a finite mixture model.  We further assume that the repeated measures are conditionally independent. We present methods to estimate the shape of the component distributions along with various features of the component distributions such as the medians, means and variances. We make no distributional assumptions on the components; indeed, we allow different shapes for different components.  相似文献   

4.
A review of the randomized response model introduced by Warner (1965) is given, then a randomized response model applicable to continuous data that considers a mixture of two normal distributions is considered. The target here is not to estimate any parameter, but rather to select the population with the best parameter value. This article provides a study on how to choose the best population between k distinct populations using an indifference-zone procedure. Also, this article includes tables for the required sample size needed in order to have a probability of correct selection higher than some specified value in the preference zone for the randomized response model considered.  相似文献   

5.
Many studies have been made of the performance of standard algorithms used to estimate the parameters of a mixture density, where data arise from two or more underlying populations. While these studies examine uncensored data, many mixture processes are right-censored. Therefore, this paper addresses the accuracy and efficiency of standard and hybrid algorithms under different degrees of right-censored data. While a common belief is that the EM algorithm is slow and inaccurate, we find that the EM generally exhibits excellent efficiency and accuracy. While extreme right censoring causes the EM to frequently fail to converge, a hybrid-EM algorithm is found to be superior at all levels of right-censoring.s  相似文献   

6.
The latent class model or multivariate multinomial mixture is a powerful approach for clustering categorical data. It uses a conditional independence assumption given the latent class to which a statistical unit is belonging. In this paper, we exploit the fact that a fully Bayesian analysis with Jeffreys non-informative prior distributions does not involve technical difficulty to propose an exact expression of the integrated complete-data likelihood, which is known as being a meaningful model selection criterion in a clustering perspective. Similarly, a Monte Carlo approximation of the integrated observed-data likelihood can be obtained in two steps: an exact integration over the parameters is followed by an approximation of the sum over all possible partitions through an importance sampling strategy. Then, the exact and the approximate criteria experimentally compete, respectively, with their standard asymptotic BIC approximations for choosing the number of mixture components. Numerical experiments on simulated data and a biological example highlight that asymptotic criteria are usually dramatically more conservative than the non-asymptotic presented criteria, not only for moderate sample sizes as expected but also for quite large sample sizes. This research highlights that asymptotic standard criteria could often fail to select some interesting structures present in the data.  相似文献   

7.
This paper introduces a new four-parameter lifetime model called the Weibull Burr XII distribution. The new model has the advantage of being capable of modeling various shapes of aging and failure criteria. We derive some of its structural properties including ordinary and incomplete moments, quantile and generating functions, probability weighted moments, and order statistics. The new density function can be expressed as a linear mixture of Burr XII densities. We propose a log-linear regression model using a new distribution so-called the log-Weibull Burr XII distribution. The maximum likelihood method is used to estimate the model parameters. Simulation results to assess the performance of the maximum likelihood estimation are discussed. We prove empirically the importance and flexibility of the new model in modeling various types of data.  相似文献   

8.
I review the use of auxiliary variables in capture-recapture models for estimation of demographic parameters (e.g. capture probability, population size, survival probability, and recruitment, emigration and immigration numbers). I focus on what has been done in current research and what still needs to be done. Typically in the literature, covariate modelling has made capture and survival probabilities functions of covariates, but there are good reasons also to make other parameters functions of covariates as well. The types of covariates considered include environmental covariates that may vary by occasion but are constant over animals, and individual animal covariates that are usually assumed constant over time. I also discuss the difficulties of using time-dependent individual animal covariates and some possible solutions. Covariates are usually assumed to be measured without error, and that may not be realistic. For closed populations, one approach to modelling heterogeneity in capture probabilities uses observable individual covariates and is thus related to the primary purpose of this paper. The now standard Huggins-Alho approach conditions on the captured animals and then uses a generalized Horvitz-Thompson estimator to estimate population size. This approach has the advantage of simplicity in that one does not have to specify a distribution for the covariates, and the disadvantage is that it does not use the full likelihood to estimate population size. Alternately one could specify a distribution for the covariates and implement a full likelihood approach to inference to estimate the capture function, the covariate probability distribution, and the population size. The general Jolly-Seber open model enables one to estimate capture probability, population sizes, survival rates, and birth numbers. Much of the focus on modelling covariates in program MARK has been for survival and capture probability in the Cormack-Jolly-Seber model and its generalizations (including tag-return models). These models condition on the number of animals marked and released. A related, but distinct, topic is radio telemetry survival modelling that typically uses a modified Kaplan-Meier method and Cox proportional hazards model for auxiliary variables. Recently there has been an emphasis on integration of recruitment in the likelihood, and research on how to implement covariate modelling for recruitment and perhaps population size is needed. The combined open and closed 'robust' design model can also benefit from covariate modelling and some important options have already been implemented into MARK. Many models are usually fitted to one data set. This has necessitated development of model selection criteria based on the AIC (Akaike Information Criteria) and the alternative of averaging over reasonable models. The special problems of estimating over-dispersion when covariates are included in the model and then adjusting for over-dispersion in model selection could benefit from further research.  相似文献   

9.
In this paper, we consider a Bayesian mixture model that allows us to integrate out the weights of the mixture in order to obtain a procedure in which the number of clusters is an unknown quantity. To determine clusters and estimate parameters of interest, we develop an MCMC algorithm denominated by sequential data-driven allocation sampler. In this algorithm, a single observation has a non-null probability to create a new cluster and a set of observations may create a new cluster through the split-merge movements. The split-merge movements are developed using a sequential allocation procedure based in allocation probabilities that are calculated according to the Kullback–Leibler divergence between the posterior distribution using the observations previously allocated and the posterior distribution including a ‘new’ observation. We verified the performance of the proposed algorithm on the simulated data and then we illustrate its use on three publicly available real data sets.  相似文献   

10.
Dead recoveries of marked animals are commonly used to estimate survival probabilities. Band‐recovery models can be parameterized either by r (the probability of recovering a band conditional on death of the animal) or by f (the probability that an animal will be killed, retrieved, and have its band reported). The T parametrization can be implemented in a capture‐recapture framework with two states (alive and newly dead), mortality being the transition probability between the two states. The authors show here that the f parametrization can also be implemented in a multistate framework by imposing simple constraints on some parameters. They illustrate it using data on the mallard and the snow goose. However, they mention that because it does not entirely separate the individual survival and encounter processes, the f parametrization must be used with care on reduced models, or in the presence of estimates at the boundary of the parameter space. As they show, a multistate framework allows the use of powerful software for model fitting or testing the goodness‐of‐fit of models; it also affords the implementation of complex models such as those based on mixture of information or uncertain states  相似文献   

11.
I review the use of auxiliary variables in capture-recapture models for estimation of demographic parameters (e.g. capture probability, population size, survival probability, and recruitment, emigration and immigration numbers). I focus on what has been done in current research and what still needs to be done. Typically in the literature, covariate modelling has made capture and survival probabilities functions of covariates, but there are good reasons also to make other parameters functions of covariates as well. The types of covariates considered include environmental covariates that may vary by occasion but are constant over animals, and individual animal covariates that are usually assumed constant over time. I also discuss the difficulties of using time-dependent individual animal covariates and some possible solutions. Covariates are usually assumed to be measured without error, and that may not be realistic. For closed populations, one approach to modelling heterogeneity in capture probabilities uses observable individual covariates and is thus related to the primary purpose of this paper. The now standard Huggins-Alho approach conditions on the captured animals and then uses a generalized Horvitz-Thompson estimator to estimate population size. This approach has the advantage of simplicity in that one does not have to specify a distribution for the covariates, and the disadvantage is that it does not use the full likelihood to estimate population size. Alternately one could specify a distribution for the covariates and implement a full likelihood approach to inference to estimate the capture function, the covariate probability distribution, and the population size. The general Jolly-Seber open model enables one to estimate capture probability, population sizes, survival rates, and birth numbers. Much of the focus on modelling covariates in program MARK has been for survival and capture probability in the Cormack-Jolly-Seber model and its generalizations (including tag-return models). These models condition on the number of animals marked and released. A related, but distinct, topic is radio telemetry survival modelling that typically uses a modified Kaplan-Meier method and Cox proportional hazards model for auxiliary variables. Recently there has been an emphasis on integration of recruitment in the likelihood, and research on how to implement covariate modelling for recruitment and perhaps population size is needed. The combined open and closed 'robust' design model can also benefit from covariate modelling and some important options have already been implemented into MARK. Many models are usually fitted to one data set. This has necessitated development of model selection criteria based on the AIC (Akaike Information Criteria) and the alternative of averaging over reasonable models. The special problems of estimating over-dispersion when covariates are included in the model and then adjusting for over-dispersion in model selection could benefit from further research.  相似文献   

12.
Summary. Models for multiple-test screening data generally require the assumption that the tests are independent conditional on disease state. This assumption may be unreasonable, especially when the biological basis of the tests is the same. We propose a model that allows for correlation between two diagnostic test results. Since models that incorporate test correlation involve more parameters than can be estimated with the available data, posterior inferences will depend more heavily on prior distributions, even with large sample sizes. If we have reasonably accurate information about one of the two screening tests (perhaps the standard currently used test) or the prevalences of the populations tested, accurate inferences about all the parameters, including the test correlation, are possible. We present a model for evaluating dependent diagnostic tests and analyse real and simulated data sets. Our analysis shows that, when the tests are correlated, a model that assumes conditional independence can perform very poorly. We recommend that, if the tests are only moderately accurate and measure the same biological responses, researchers use the dependence model for their analyses.  相似文献   

13.
Independent factor analysis (IFA) has recently been proposed in the signal processing literature as a way to model a set of observed variables through linear combinations of latent independent variables and a noise term. A peculiarity of the method is that it defines a probability density function for the latent variables by mixtures of Gaussians. The aim of this paper is to cast the method into a more rigorous statistical framework and to propose some developments. In the first part, we present the IFA model in its population version, address identifiability issues and draw some parallels between the IFA model and the ordinary factor analysis (FA) one. Then we show that the IFA model may be reinterpreted as an independent component analysis-based rotation of an ordinary FA solution. We also give evidence that the IFA model represents a special case of mixture of factor analysers. In the second part, we address inferential issues, also deriving the standard errors for the model parameter estimates and providing model selection criteria. Finally, we present some empirical results on real data sets.  相似文献   

14.
Summary.  Cohort studies of individuals infected with the human immunodeficiency virus (HIV) provide useful information on the past pattern of HIV diagnoses, progression of the disease and use of antiretroviral therapy. We propose a new method for using individual data from an open prevalent cohort study to estimate the incidence of HIV, by jointly modelling the HIV diagnosis, the inclusion in the cohort and the progression of the disease in a Markov model framework. The estimation procedure involves the construction of a likelihood function which takes into account the probability of observing the total number of subjects who are enrolled in the cohort and the probabilities of passage through the stages of disease for each observed subject conditionally on being included in the cohort. The estimator of the HIV infection rate is defined as the function which maximizes a penalized likelihood, and the solution of this maximization problem is approximated on a basis of cubic M -splines. The method is illustrated by using cohort data from a hospital-based surveillance system of HIV infection in Aquitaine, a region of south-western France. A simulation study is performed to study the ability of the model to reconstruct the incidence of HIV from prevalent cohort data.  相似文献   

15.
We propose a semiparametric modeling approach for mixtures of symmetric distributions. The mixture model is built from a common symmetric density with different components arising through different location parameters. This structure ensures identifiability for mixture components, which is a key feature of the model as it allows applications to settings where primary interest is inference for the subpopulations comprising the mixture. We focus on the two-component mixture setting and develop a Bayesian model using parametric priors for the location parameters and for the mixture proportion, and a nonparametric prior probability model, based on Dirichlet process mixtures, for the random symmetric density. We present an approach to inference using Markov chain Monte Carlo posterior simulation. The performance of the model is studied with a simulation experiment and through analysis of a rainfall precipitation data set as well as with data on eruptions of the Old Faithful geyser.  相似文献   

16.
We address the problem of estimating the proportions of two statistical populations in a given mixture on the basis of an unlabeled sample of n–dimensional observations on the mixture. Assuming that the expected values of observations on the two populations are known, we show that almost any linear map from Rn to R1 yields an unbiased consistent estimate of the proportion of one population in a very easy way. We then find that linear map for which the resulting proportion estimate has minimum variance among all estimates so obtained. After deriving a simple expression for the minimum-variance estimate, we discuss practical aspects of obtaining this and related estimates.  相似文献   

17.
In this article, we use a latent class model (LCM) with prevalence modeled as a function of covariates to assess diagnostic test accuracy in situations where the true disease status is not observed, but observations on three or more conditionally independent diagnostic tests are available. A fast Monte Carlo expectation–maximization (MCEM) algorithm with binary (disease) diagnostic data is implemented to estimate parameters of interest; namely, sensitivity, specificity, and prevalence of the disease as a function of covariates. To obtain standard errors for confidence interval construction of estimated parameters, the missing information principle is applied to adjust information matrix estimates. We compare the adjusted information matrix-based standard error estimates with the bootstrap standard error estimates both obtained using the fast MCEM algorithm through an extensive Monte Carlo study. Simulation demonstrates that the adjusted information matrix approach estimates the standard error similarly with the bootstrap methods under certain scenarios. The bootstrap percentile intervals have satisfactory coverage probabilities. We then apply the LCM analysis to a real data set of 122 subjects from a Gynecologic Oncology Group study of significant cervical lesion diagnosis in women with atypical glandular cells of undetermined significance to compare the diagnostic accuracy of a histology-based evaluation, a carbonic anhydrase-IX biomarker-based test and a human papillomavirus DNA test.  相似文献   

18.
In unsupervised classification, Hidden Markov Models (HMM) are used to account for a neighborhood structure between observations. The emission distributions are often supposed to belong to some parametric family. In this paper, a semiparametric model where the emission distributions are a mixture of parametric distributions is proposed to get a higher flexibility. We show that the standard EM algorithm can be adapted to infer the model parameters. For the initialization step, starting from a large number of components, a hierarchical method to combine them into the hidden states is proposed. Three likelihood-based criteria to select the components to be combined are discussed. To estimate the number of hidden states, BIC-like criteria are derived. A simulation study is carried out both to determine the best combination between the combining criteria and the model selection criteria and to evaluate the accuracy of classification. The proposed method is also illustrated using a biological dataset from the model plant Arabidopsis thaliana. A R package HMMmix is freely available on the CRAN.  相似文献   

19.
In the framework of model-based cluster analysis, finite mixtures of Gaussian components represent an important class of statistical models widely employed for dealing with quantitative variables. Within this class, we propose novel models in which constraints on the component-specific variance matrices allow us to define Gaussian parsimonious clustering models. Specifically, the proposed models are obtained by assuming that the variables can be partitioned into groups resulting to be conditionally independent within components, thus producing component-specific variance matrices with a block diagonal structure. This approach allows us to extend the methods for model-based cluster analysis and to make them more flexible and versatile. In this paper, Gaussian mixture models are studied under the above mentioned assumption. Identifiability conditions are proved and the model parameters are estimated through the maximum likelihood method by using the Expectation-Maximization algorithm. The Bayesian information criterion is proposed for selecting the partition of the variables into conditionally independent groups. The consistency of the use of this criterion is proved under regularity conditions. In order to examine and compare models with different partitions of the set of variables a hierarchical algorithm is suggested. A wide class of parsimonious Gaussian models is also presented by parameterizing the component-variance matrices according to their spectral decomposition. The effectiveness and usefulness of the proposed methodology are illustrated with two examples based on real datasets.  相似文献   

20.
Procedures are derived for selecting, with controlled probability of error, (1) a subset of populations which contains all populations better than a dual probability/proportion standard and (2) a subset of populations which both contains all populations better than an upper probability/proportion standard and also contains no populations worse than a lower probability/proportion standard. The procedures are motivated by current investigations in the area of computer performance evaluation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号