期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Mixture of latent trait analyzers for model-based clustering of categorical data

Isabella Gollini Thomas Brendan Murphy 《Statistics and Computing》2014,24(4):569-588

Model-based clustering methods for continuous data are well established and commonly used in a wide range of applications. However, model-based clustering methods for categorical data are less standard. Latent class analysis is a commonly used method for model-based clustering of binary data and/or categorical data, but due to an assumed local independence structure there may not be a correspondence between the estimated latent classes and groups in the population of interest. The mixture of latent trait analyzers model extends latent class analysis by assuming a model for the categorical response variables that depends on both a categorical latent class and a continuous latent trait variable; the discrete latent class accommodates group structure and the continuous latent trait accommodates dependence within these groups. Fitting the mixture of latent trait analyzers model is potentially difficult because the likelihood function involves an integral that cannot be evaluated analytically. We develop a variational approach for fitting the mixture of latent trait models and this provides an efficient model fitting strategy. The mixture of latent trait analyzers model is demonstrated on the analysis of data from the National Long Term Care Survey (NLTCS) and voting in the U.S. Congress. The model is shown to yield intuitive clustering results and it gives a much better fit than either latent class analysis or latent trait analysis alone. 相似文献

2.

Latent class based multiple imputation approach for missing categorical data

Mulugeta Gebregziabher Stacia M. DeSantis 《Journal of statistical planning and inference》2010

In this paper we propose a latent class based multiple imputation approach for analyzing missing categorical covariate data in a highly stratified data model. In this approach, we impute the missing data assuming a latent class imputation model and we use likelihood methods to analyze the imputed data. Via extensive simulations, we study its statistical properties and make comparisons with complete case analysis, multiple imputation, saturated log-linear multiple imputation and the Expectation–Maximization approach under seven missing data mechanisms (including missing completely at random, missing at random and not missing at random). These methods are compared with respect to bias, asymptotic standard error, type I error, and 95% coverage probabilities of parameter estimates. Simulations show that, under many missingness scenarios, latent class multiple imputation performs favorably when jointly considering these criteria. A data example from a matched case–control study of the association between multiple myeloma and polymorphisms of the Inter-Leukin 6 genes is considered. 相似文献

3.

A likelihood ratio test of a homoscedastic normal mixture against a heteroscedastic normal mixture

Yungtai Lo 《Statistics and Computing》2008,18(3):233-240

It is generally assumed that the likelihood ratio statistic for testing the null hypothesis that data arise from a homoscedastic normal mixture distribution versus the alternative hypothesis that data arise from a heteroscedastic normal mixture distribution has an asymptotic χ ² reference distribution with degrees of freedom equal to the difference in the number of parameters being estimated under the alternative and null models under some regularity conditions. Simulations show that the χ ² reference distribution will give a reasonable approximation for the likelihood ratio test only when the sample size is 2000 or more and the mixture components are well separated when the restrictions suggested by Hathaway (Ann. Stat. 13:795–800, 1985) are imposed on the component variances to ensure that the likelihood is bounded under the alternative distribution. For small and medium sample sizes, parametric bootstrap tests appear to work well for determining whether data arise from a normal mixture with equal variances or a normal mixture with unequal variances. 相似文献

4.

Model-based classification using latent Gaussian mixture models 总被引：1，自引：0，他引：1

Paul D. McNicholas 《Journal of statistical planning and inference》2010

A novel model-based classification technique is introduced based on parsimonious Gaussian mixture models (PGMMs). PGMMs, which were introduced recently as a model-based clustering technique, arise from a generalization of the mixtures of factor analyzers model and are based on a latent Gaussian mixture model. In this paper, this mixture modelling structure is used for model-based classification and the particular area of application is food authenticity. Model-based classification is performed by jointly modelling data with known and unknown group memberships within a likelihood framework and then estimating parameters, including the unknown group memberships, within an alternating expectation-conditional maximization framework. Model selection is carried out using the Bayesian information criteria and the quality of the maximum a posteriori classifications is summarized using the misclassification rate and the adjusted Rand index. This new model-based classification technique gives excellent classification performance when applied to real food authenticity data on the chemical properties of olive oils from nine areas of Italy. 相似文献

5.

The Multi-Sample Block-Scalar Sphericity Test: Exact and Near-Exact Distributions for Its Likelihood Ratio Test Statistic

Carlos A. Coelho Filipe J. Marques 《统计学通讯:理论与方法》2013,42(7):1153-1175

In this article the authors show how by adequately decomposing the null hypothesis of the multi-sample block-scalar sphericity test it is possible to obtain the likelihood ratio test statistic as well as a different look over its exact distribution. This enables the construction of well-performing near-exact approximations for the distribution of the test statistic, whose exact distribution is quite elaborate and non-manageable. The near-exact distributions obtained are manageable and perform much better than the available asymptotic distributions, even for small sample sizes, and they show a good asymptotic behavior for increasing sample sizes as well as for increasing number of variables and/or populations involved. 相似文献

6.

GAMMA-BASED CLUSTERING VIA ORDERED MEANS WITH APPLICATION TO GENE-EXPRESSION ANALYSIS

Newton MA Chung LM 《Annals of statistics》2010,38(6):3217-3244

Discrete mixture models provide a well-known basis for effective clustering algorithms, although technical challenges have limited their scope. In the context of gene-expression data analysis, a model is presented that mixes over a finite catalog of structures, each one representing equality and inequality constraints among latent expected values. Computations depend on the probability that independent gamma-distributed variables attain each of their possible orderings. Each ordering event is equivalent to an event in independent negative-binomial random variables, and this finding guides a dynamic-programming calculation. The structuring of mixture-model components according to constraints among latent means leads to strict concavity of the mixture log likelihood. In addition to its beneficial numerical properties, the clustering method shows promising results in an empirical study. 相似文献

7.

Tests for assessment of agreement using probability criteria

Pankaj K. Choudhary H.N. Nagaraja 《Journal of statistical planning and inference》2007

For the assessment of agreement using probability criteria, we obtain an exact test, and for sample sizes exceeding 30, we give a bootstrap-t

t

test that is remarkably accurate. We show that for assessing agreement, the total deviation index approach of Lin [2000. Total deviation index for measuring individual agreement with applications in laboratory performance and bioequivalence. Statist. Med. 19, 255–270] is not consistent and may not preserve its asymptotic nominal level, and that the coverage probability approach of Lin et al. [2002. Statistical methods in assessing agreement: models, issues and tools. J. Amer. Statist. Assoc. 97, 257–270] is overly conservative for moderate sample sizes. We also show that the nearly unbiased test of Wang and Hwang [2001. A nearly unbiased test for individual bioequivalence problems using probability criteria. J. Statist. Plann. Inference 99, 41–58] may be liberal for large sample sizes, and suggest a minor modification that gives numerically equivalent approximation to the exact test for sample sizes 30 or less. We present a simple and accurate sample size formula for planning studies on assessing agreement, and illustrate our methodology with a real data set from the literature. 相似文献

8.

Distributions of lrt associated with two-parameter exponential distributions

B. Nagarsenker P.B. Nagarsenker 《统计学通讯:理论与方法》2013,42(14):1583-1593

In this paper, an exact distribution of the likelihood ratio criterion for testing the equality of p two-parameter exponential distributions is obtained for unequal sample sizes in a computational form. A useful asymptotic expansion of the distribution is also obtained up to the order of n^-4 with the second term of the order of n^-3 and so can be used to obtain accurate approximations to the critical values of the test statistic even for comparatively small values of n where n is the combined sample size. In fact the first term alone which is a single beta distribution provides a powerful approximation for moderately large values of n. 相似文献

9.

Application of a predictive distribution formula to Bayesian computation for incomplete data models

Trevor Sweeting Samer Kharroubi 《Statistics and Computing》2005,15(3):167-178

We consider exact and approximate Bayesian computation in the presence of latent variables or missing data. Specifically we explore the application of a posterior predictive distribution formula derived in Sweeting And Kharroubi (2003), which is a particular form of Laplace approximation, both as an importance function and a proposal distribution. We show that this formula provides a stable importance function for use within poor man’s data augmentation schemes and that it can also be used as a proposal distribution within a Metropolis-Hastings algorithm for models that are not analytically tractable. We illustrate both uses in the case of a censored regression model and a normal hierarchical model, with both normal and Student t distributed random effects. Although the predictive distribution formula is motivated by regular asymptotic theory, it is not necessary that the likelihood has a closed form or that it possesses a local maximum. 相似文献

10.

Computing highly accurate confidence limits from discrete data using importance sampling

Chris J. Lloyd Degui Li 《Statistics and Computing》2014,24(4):663-673

For discrete data, frequentist confidence limits based on a normal approximation to standard likelihood based pivotal quantities can perform poorly, even for quite large sample sizes. To construct exact limits requires the probability of a suitable tail set as a function of the unknown parameters. In this paper, importance sampling is used to estimate this surface and hence the confidence limits. The technology is simple and straightforward to implement. Unlike the recent methodology of Garthwaite and Jones (in J. Comput. Graph. Stat. 18, 184–200, 2009), the new method allows for nuisance parameters; is an order of magnitude more efficient than the Robbins-Monro bound; does not require any simulation phases or tuning constants; gives a straightforward simulation standard error for the target limit; includes a simple diagnostic for simulation breakdown. 相似文献

11.

Model-based clustering,classification, and discriminant analysis of data with mixed type

Ryan P. Browne Paul D. McNicholas 《Journal of statistical planning and inference》2012

We propose a mixture of latent variables model for the model-based clustering, classification, and discriminant analysis of data comprising variables with mixed type. This approach is a generalization of latent variable analysis, and model fitting is carried out within the expectation-maximization framework. Our approach is outlined and a simulation study conducted to illustrate the effect of sample size and noise on the standard errors and the recovery probabilities for the number of groups. Our modelling methodology is then applied to two real data sets and their clustering and classification performance is discussed. We conclude with discussion and suggestions for future work. 相似文献

12.

The effects of different choices of order for autoregressive approximation on the Gaussian likelihood estimates for ARMA models

M. O. Salau 《Statistical Papers》2003,44(1):89-105

This paper investigates, by means of Monte Carlo simulation, the effects of different choices of order for autoregressive approximation on the fully efficient parameter estimates for autoregressive moving average models. Four order selection criteria, AIC, BIC, HQ and PKK, were compared and different model structures with varying sample sizes were used to contrast the performance of the criteria. Some asymptotic results which provide a useful guide for assessing the performance of these criteria are presented. The results of this comparison show that there are marked differences in the accuracy implied using these alternative criteria in small sample situations and that it is preferable to apply BIC criterion, which leads to greater precision of Gaussian likelihood estimates, in such cases. Implications of the findings of this study for the estimation of time series models are highlighted. 相似文献

13.

ROBUSTNESS PROPERTIES OF LOGNORMAL CONFIDENCE INTERVALS FOR LOGNORMAL AND GAMMA DISTRIBUTED DATA

《统计学通讯:理论与方法》2013,42(11):1939-1957

ABSTRACT

The performances of six confidence intervals for estimating the arithmetic mean of a lognormal distribution are compared using simulated data. The first interval considered is based on an exact method and is recommended in U.S. EPA guidance documents for calculating upper confidence limits for contamination data. Two intervals are based on asymptotic properties due to the Central Limit Theorem, and the other three are based on transformations and maximum likelihood estimation. The effects of departures from lognormality on the performance of these intervals are also investigated. The gamma distribution is considered to represent departures from the lognormal distribution. The average width and coverage of each confidence interval is reported for varying mean, variance, and sample size. In the lognormal case, the exact interval gives good coverage, but for small sample sizes and large variances the confidence intervals are too wide. In these cases, an approximation that incorporates sampling variability of the sample variance tends to perform better. When the underlying distribution is a gamma distribution, the intervals based upon the Central Limit Theorem tend to perform better than those based upon lognormal assumptions. 相似文献

14.

Asymptotic approximations for the distributions of the <Emphasis Type="Italic">K</Emphasis>ø-divergence goodness-of-fit statistics

T. Pérez J. A. Pardo 《Statistical Papers》2003,44(3):349-366

Kø-divergence’s statistic family for goodness-of-fit, under the null hypothesis, has an asymptotic chi-squared distribution; however, for small samples, the chi-squared approximation in some cases does not well agree with the exact distribution. In this paper, a closer approximation to the exact distribution is obtained by extracting the ø-dependent second order component from the distribution. Moreover, numerical results are presented for moderate sample sizes with moderate number of cells. 相似文献

15.

Large sample interval mapping method for genetic trait loci in finite regression mixture models

Hong Zhang Hanfeng Chen Zhaohai Li 《Journal of statistical planning and inference》2009

This article investigates the large sample interval mapping method for genetic trait loci (GTL) in a finite non-linear regression mixture model. The general model includes most commonly used kernel functions, such as exponential family mixture, logistic regression mixture and generalized linear mixture models, as special cases. The populations derived from either the backcross or intercross design are considered. In particular, unlike all existing results in the literature in the finite mixture models, the large sample results presented in this paper do not require the boundness condition on the parametric space. Therefore, the large sample theory presented in this article possesses general applicability to the interval mapping method of GTL in genetic research. The limiting null distribution of the likelihood ratio test statistics can be utilized easily to determine the threshold values or p-values required in the interval mapping. The limiting distribution is proved to be free of the parameter values of null model and free of the choice of a kernel function. Extension to the multiple marker interval GTL detection is also discussed. Simulation study results show favorable performance of the asymptotic procedure when sample sizes are moderate. 相似文献

16.

Unsupervised learning of regression mixture models with unknown number of components

《Journal of Statistical Computation and Simulation》2012,82(12):2308-2334

ABSTRACT

We propose a new unsupervised learning algorithm to fit regression mixture models with unknown number of components. The developed approach consists in a penalized maximum likelihood estimation carried out by a robust expectation–maximization (EM)-like algorithm. We derive it for polynomial, spline, and B-spline regression mixtures. The proposed learning approach is unsupervised: (i) it simultaneously infers the model parameters and the optimal number of the regression mixture components from the data as the learning proceeds, rather than in a two-fold scheme as in standard model-based clustering using afterward model selection criteria, and (ii) it does not require accurate initialization unlike the standard EM for regression mixtures. The developed approach is applied to curve clustering problems. Numerical experiments on simulated and real data show that the proposed algorithm performs well and provides accurate clustering results, and confirm its benefit for practical applications. 相似文献

17.

Asymptotics for tests on mean profiles,additional information and dimensionality under non-normality

Solomon W. Harrar 《Journal of statistical planning and inference》2009

We consider the comparison of mean vectors for k groups when k is large and sample size per group is fixed. The asymptotic null and non-null distributions of the normal theory likelihood ratio, Lawley–Hotelling and Bartlett–Nanda–Pillai statistics are derived under general conditions. We extend the results to tests on the profiles of the mean vectors, tests for additional information (provided by a sub-vector of the responses over and beyond the remaining sub-vector of responses in separating the groups) and tests on the dimension of the hyperplane formed by the mean vectors. Our techniques are based on perturbation expansions and limit theorems applied to independent but non-identically distributed sequences of quadratic forms in random matrices. In all these four MANOVA problems, the asymptotic null and non-null distributions are normal. Both the null and non-null distributions are asymptotically invariant to non-normality when the group sample sizes are equal. In the unbalanced case, a slight modification of the test statistics will lead to asymptotically robust tests. Based on the robustness results, some approaches for finite approximation are introduced. The numerical results provide strong support for the asymptotic results and finiteness approximations. 相似文献

18.

Sample size calculations for single-arm survival studies using transformations of the Kaplan–Meier estimator

Kengo Nagashima Hisashi Noma Yasunori Sato Masahiko Gosho 《Pharmaceutical statistics》2021,20(3):499-511

In single-arm clinical trials with survival outcomes, the Kaplan–Meier estimator and its confidence interval are widely used to assess survival probability and median survival time. Since the asymptotic normality of the Kaplan–Meier estimator is a common result, the sample size calculation methods have not been studied in depth. An existing sample size calculation method is founded on the asymptotic normality of the Kaplan–Meier estimator using the log transformation. However, the small sample properties of the log transformed estimator are quite poor in small sample sizes (which are typical situations in single-arm trials), and the existing method uses an inappropriate standard normal approximation to calculate sample sizes. These issues can seriously influence the accuracy of results. In this paper, we propose alternative methods to determine sample sizes based on a valid standard normal approximation with several transformations that may give an accurate normal approximation even with small sample sizes. In numerical evaluations via simulations, some of the proposed methods provided more accurate results, and the empirical power of the proposed method with the arcsine square-root transformation tended to be closer to a prescribed power than the other transformations. These results were supported when methods were applied to data from three clinical trials. 相似文献

19.

Estimation of generalized linear latent variable models

Philippe Huber Elvezio Ronchetti Maria-Pia Victoria-Feser 《Journal of the Royal Statistical Society. Series B, Statistical methodology》2004,66(4):893-908

Summary. Generalized linear latent variable models (GLLVMs), as defined by Bartholomew and Knott, enable modelling of relationships between manifest and latent variables. They extend structural equation modelling techniques, which are powerful tools in the social sciences. However, because of the complexity of the log-likelihood function of a GLLVM, an approximation such as numerical integration must be used for inference. This can limit drastically the number of variables in the model and can lead to biased estimators. We propose a new estimator for the parameters of a GLLVM, based on a Laplace approximation to the likelihood function and which can be computed even for models with a large number of variables. The new estimator can be viewed as an M -estimator, leading to readily available asymptotic properties and correct inference. A simulation study shows its excellent finite sample properties, in particular when compared with a well-established approach such as LISREL. A real data example on the measurement of wealth for the computation of multidimensional inequality is analysed to highlight the importance of the methodology. 相似文献

20.

Analysis of a continuous-time proportional hazard model using discrete duration data

Keunkwan Ryu 《Econometric Reviews》1995,14(3):299-313

Many economic duration variables are often available only up to intervals, and not up to exact points. However, continuous time duration models are conceptually superior to discrete ones. Hence, in duration analyses, one faces a situation with discrete data and a continuous model. This paper discusses (i) the asymptotic bias of a conventional approximation procedure in which a discrete duration is treated as an exact observation; and (ii) the efficiency of a correct maximum likelihood estimator which appropriately accounts for the discrete nature of the data. 相似文献