期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Variable selection for model-based clustering using the integrated complete-data likelihood

Matthieu Marbac Mohammed Sedki 《Statistics and Computing》2017,27(4):1049-1063

Variable selection in cluster analysis is important yet challenging. It can be achieved by regularization methods, which realize a trade-off between the clustering accuracy and the number of selected variables by using a lasso-type penalty. However, the calibration of the penalty term can suffer from criticisms. Model selection methods are an efficient alternative, yet they require a difficult optimization of an information criterion which involves combinatorial problems. First, most of these optimization algorithms are based on a suboptimal procedure (e.g. stepwise method). Second, the algorithms are often computationally expensive because they need multiple calls of EM algorithms. Here we propose to use a new information criterion based on the integrated complete-data likelihood. It does not require the maximum likelihood estimate and its maximization appears to be simple and computationally efficient. The original contribution of our approach is to perform the model selection without requiring any parameter estimation. Then, parameter inference is needed only for the unique selected model. This approach is used for the variable selection of a Gaussian mixture model with conditional independence assumed. The numerical experiments on simulated and benchmark datasets show that the proposed method often outperforms two classical approaches for variable selection. The proposed approach is implemented in the R package VarSelLCM available on CRAN. 相似文献

2.

Mixture of latent trait analyzers for model-based clustering of categorical data

Isabella Gollini Thomas Brendan Murphy 《Statistics and Computing》2014,24(4):569-588

Model-based clustering methods for continuous data are well established and commonly used in a wide range of applications. However, model-based clustering methods for categorical data are less standard. Latent class analysis is a commonly used method for model-based clustering of binary data and/or categorical data, but due to an assumed local independence structure there may not be a correspondence between the estimated latent classes and groups in the population of interest. The mixture of latent trait analyzers model extends latent class analysis by assuming a model for the categorical response variables that depends on both a categorical latent class and a continuous latent trait variable; the discrete latent class accommodates group structure and the continuous latent trait accommodates dependence within these groups. Fitting the mixture of latent trait analyzers model is potentially difficult because the likelihood function involves an integral that cannot be evaluated analytically. We develop a variational approach for fitting the mixture of latent trait models and this provides an efficient model fitting strategy. The mixture of latent trait analyzers model is demonstrated on the analysis of data from the National Long Term Care Survey (NLTCS) and voting in the U.S. Congress. The model is shown to yield intuitive clustering results and it gives a much better fit than either latent class analysis or latent trait analysis alone. 相似文献

3.

ON THE COVERAGE PROBABILITY OF CONFIDENCE INTERVALS IN REGRESSION AFTER VARIABLE SELECTION 总被引：1，自引：1，他引：0

Paul Kabaila 《Australian & New Zealand Journal of Statistics》2005,47(4):549-562

This paper considers a linear regression model with regression parameter vector β. The parameter of interest is θ= a^Tβ where a is specified. When, as a first step, a data‐based variable selection (e.g. minimum Akaike information criterion) is used to select a model, it is common statistical practice to then carry out inference about θ, using the same data, based on the (false) assumption that the selected model had been provided a priori. The paper considers a confidence interval for θ with nominal coverage 1 ‐ α constructed on this (false) assumption, and calls this the naive 1 ‐ α confidence interval. The minimum coverage probability of this confidence interval can be calculated for simple variable selection procedures involving only a single variable. However, the kinds of variable selection procedures used in practice are typically much more complicated. For the real‐life data presented in this paper, there are 20 variables each of which is to be either included or not, leading to 2²⁰ different models. The coverage probability at any given value of the parameters provides an upper bound on the minimum coverage probability of the naive confidence interval. This paper derives a new Monte Carlo simulation estimator of the coverage probability, which uses conditioning for variance reduction. For these real‐life data, the gain in efficiency of this Monte Carlo simulation due to conditioning ranged from 2 to 6. The paper also presents a simple one‐dimensional search strategy for parameter values at which the coverage probability is relatively small. For these real‐life data, this search leads to parameter values for which the coverage probability of the naive 0.95 confidence interval is 0.79 for variable selection using the Akaike information criterion and 0.70 for variable selection using Bayes information criterion, showing that these confidence intervals are completely inadequate. 相似文献

4.

Finite Mixture of Generalized Semiparametric Models: Variable Selection via Penalized Estimation

Farzad Eskandari Ehsan Ormoz 《统计学通讯:模拟与计算》2016,45(10):3744-3759

Selection of the important variables is one of the most important model selection problems in statistical applications. In this article, we address variable selection in finite mixture of generalized semiparametric models. To overcome computational burden, we introduce a class of variable selection procedures for finite mixture of generalized semiparametric models using penalized approach for variable selection. Estimation of nonparametric component will be done via multivariate kernel regression. It is shown that the new method is consistent for variable selection and the performance of proposed method will be assessed via simulation. 相似文献

5.

Variable Selection for Naive Bayes Semisupervised Learning

Byoung-Jeong Choi Kwang-Rae Kim Kyu-Dong Cho Changyi Park 《统计学通讯:模拟与计算》2013,42(10):2702-2713

This article deals with a semisupervised learning based on naive Bayes assumption. A univariate Gaussian mixture density is used for continuous input variables whereas a histogram type density is adopted for discrete input variables. The EM algorithm is used for the computation of maximum likelihood estimators of parameters in the model when we fix the number of mixing components for each continuous input variable. We carry out a model selection for choosing a parsimonious model among various fitted models based on an information criterion. A common density method is proposed for the selection of significant input variables. Simulated and real datasets are used to illustrate the performance of the proposed method. 相似文献

6.

Model-based clustering of Gaussian copulas for mixed data 总被引：1，自引：0，他引：1

Matthieu Marbac Christophe Biernacki Vincent Vandewalle 《统计学通讯:理论与方法》2017,46(23):11635-11656

Clustering of mixed data is important yet challenging due to a shortage of conventional distributions for such data. In this article, we propose a mixture model of Gaussian copulas for clustering mixed data. Indeed copulas, and Gaussian copulas in particular, are powerful tools for easily modeling the distribution of multivariate variables. This model clusters data sets with continuous, integer, and ordinal variables (all having a cumulative distribution function) by considering the intra-component dependencies in a similar way to the Gaussian mixture. Indeed, each component of the Gaussian copula mixture produces a correlation coefficient for each pair of variables and its univariate margins follow standard distributions (Gaussian, Poisson, and ordered multinomial) depending on the nature of the variable (continuous, integer, or ordinal). As an interesting by-product, this model generalizes many well-known approaches and provides tools for visualization based on its parameters. The Bayesian inference is achieved with a Metropolis-within-Gibbs sampler. The numerical experiments, on simulated and real data, illustrate the benefits of the proposed model: flexible and meaningful parameterization combined with visualization features. 相似文献

7.

Clustering gene expression time course data using mixtures of multivariate t-distributions

Paul D. McNicholas Sanjeena Subedi 《Journal of statistical planning and inference》2012,142(5):1114-1127

Clustering gene expression time course data is an important problem in bioinformatics because understanding which genes behave similarly can lead to the discovery of important biological information. Statistically, the problem of clustering time course data is a special case of the more general problem of clustering longitudinal data. In this paper, a very general and flexible model-based technique is used to cluster longitudinal data. Mixtures of multivariate t-distributions are utilized, with a linear model for the mean and a modified Cholesky-decomposed covariance structure. Constraints are placed upon the covariance structure, leading to a novel family of mixture models, including parsimonious models. In addition to model-based clustering, these models are also used for model-based classification, i.e., semi-supervised clustering. Parameters, including the component degrees of freedom, are estimated using an expectation-maximization algorithm and two different approaches to model selection are considered. The models are applied to simulated data to illustrate their efficacy; this includes a comparison with their Gaussian analogues—the use of these Gaussian analogues with a linear model for the mean is novel in itself. Our family of multivariate t mixture models is then applied to two real gene expression time course data sets and the results are discussed. We conclude with a summary, suggestions for future work, and a discussion about constraining the degrees of freedom parameter. 相似文献

8.

A Comparison of Bayesian and Frequentist Interval Estimators in Regression that Utilize Uncertain Prior Information

下载免费PDF全文

Paul Kabaila Gayan Dharmarathne 《Australian & New Zealand Journal of Statistics》2015,57(1):99-118

相似文献

9.

Curve prediction and clustering with mixtures of Gaussian process functional regression models

J. Q. Shi B. Wang 《Statistics and Computing》2008,18(3):267-283

Shi, Wang, Murray-Smith and Titterington (Biometrics 63:714–723, 2007) proposed a Gaussian process functional regression (GPFR) model to model functional response curves with a set of functional covariates. Two main problems are addressed by their method: modelling nonlinear and nonparametric regression relationship and modelling covariance structure and mean structure simultaneously. The method gives very good results for curve fitting and prediction but side-steps the problem of heterogeneity. In this paper we present a new method for modelling functional data with ‘spatially’ indexed data, i.e., the heterogeneity is dependent on factors such as region and individual patient’s information. For data collected from different sources, we assume that the data corresponding to each curve (or batch) follows a Gaussian process functional regression model as a lower-level model, and introduce an allocation model for the latent indicator variables as a higher-level model. This higher-level model is dependent on the information related to each batch. This method takes advantage of both GPFR and mixture models and therefore improves the accuracy of predictions. The mixture model has also been used for curve clustering, but focusing on the problem of clustering functional relationships between response curve and covariates, i.e. the clustering is based on the surface shape of the functional response against the set of functional covariates. The model is examined on simulated data and real data. 相似文献

10.

Identifiable finite mixtures of location models for clustering mixed-mode data

Willse Alan Boik Robert J. 《Statistics and Computing》1999,9(2):111-121

For clustering mixed categorical and continuous data, Lawrence and Krzanowski (1996) proposed a finite mixture model in which component densities conform to the location model. In the graphical models literature the location model is known as the homogeneous Conditional Gaussian model. In this paper it is shown that their model is not identifiable without imposing additional restrictions. Specifically, for g groups and m locations, (g!)m–1 distinct sets of parameter values (not including permutations of the group mixing parameters) produce the same likelihood function. Excessive shrinkage of parameter estimates in a simulation experiment reported by Lawrence and Krzanowski (1996) is shown to be an artifact of the model's non-identifiability. Identifiable finite mixture models can be obtained by imposing restrictions on the conditional means of the continuous variables. These new identified models are assessed in simulation experiments. The conditional mean structure of the continuous variables in the restricted location mixture models is similar to that in the underlying variable mixture models proposed by Everitt (1988), but the restricted location mixture models are more computationally tractable. 相似文献

11.

Dimension reduction for model-based clustering 总被引：1，自引：0，他引：1

Luca Scrucca 《Statistics and Computing》2010,20(4):471-484

We introduce a dimension reduction method for visualizing the clustering structure obtained from a finite mixture of Gaussian densities. Information on the dimension reduction subspace is obtained from the variation on group means and, depending on the estimated mixture model, on the variation on group covariances. The proposed method aims at reducing the dimensionality by identifying a set of linear combinations, ordered by importance as quantified by the associated eigenvalues, of the original features which capture most of the cluster structure contained in the data. Observations may then be projected onto such a reduced subspace, thus providing summary plots which help to visualize the clustering structure. These plots can be particularly appealing in the case of high-dimensional data and noisy structure. The new constructed variables capture most of the clustering information available in the data, and they can be further reduced to improve clustering performance. We illustrate the approach on both simulated and real data sets. 相似文献

12.

Model-based classification using latent Gaussian mixture models 总被引：1，自引：0，他引：1

Paul D. McNicholas 《Journal of statistical planning and inference》2010

A novel model-based classification technique is introduced based on parsimonious Gaussian mixture models (PGMMs). PGMMs, which were introduced recently as a model-based clustering technique, arise from a generalization of the mixtures of factor analyzers model and are based on a latent Gaussian mixture model. In this paper, this mixture modelling structure is used for model-based classification and the particular area of application is food authenticity. Model-based classification is performed by jointly modelling data with known and unknown group memberships within a likelihood framework and then estimating parameters, including the unknown group memberships, within an alternating expectation-conditional maximization framework. Model selection is carried out using the Bayesian information criteria and the quality of the maximum a posteriori classifications is summarized using the misclassification rate and the adjusted Rand index. This new model-based classification technique gives excellent classification performance when applied to real food authenticity data on the chemical properties of olive oils from nine areas of Italy. 相似文献

13.

BAYESIAN SUBSET SELECTION AND MODEL AVERAGING USING A CENTRED AND DISPERSED PRIOR FOR THE ERROR VARIANCE 总被引：1，自引：0，他引：1

Edward Cripps Robert Kohn David Nott 《Australian & New Zealand Journal of Statistics》2006,48(2):237-252

This article proposes a new data‐based prior distribution for the error variance in a Gaussian linear regression model, when the model is used for Bayesian variable selection and model averaging. For a given subset of variables in the model, this prior has a mode that is an unbiased estimator of the error variance but is suitably dispersed to make it uninformative relative to the marginal likelihood. The advantage of this empirical Bayes prior for the error variance is that it is centred and dispersed sensibly and avoids the arbitrary specification of hyperparameters. The performance of the new prior is compared to that of a prior proposed previously in the literature using several simulated examples and two loss functions. For each example our paper also reports results for the model that orthogonalizes the predictor variables before performing subset selection. A real example is also investigated. The empirical results suggest that for both the simulated and real data, the performance of the estimators based on the prior proposed in our article compares favourably with that of a prior used previously in the literature. 相似文献

14.

Variable selection in classification model via quadratic programming

Jun Huang Wei Wang 《统计学通讯:模拟与计算》2018,47(7):1922-1939

Variable selection is an important decision process in consumer credit scoring. However, with the rapid growth in credit industry, especially, after the rising of e-commerce, a huge amount of information on customer behavior is available to provide more informative implication of consumer credit scoring. In this study, a hybrid quadratic programming model is proposed for consumer credit scoring problems by variable selection. The proposed model is then solved with a bisection method based on Tabu search algorithm (BMTS), and the solution of this model provides alternative subsets of variables in different sizes. The final subset of variables used in consumer credit scoring model is selected based on both the size (number of variables in a subset) and predictive (classification) accuracy rate. Simulation studies are used to measure the performances of the proposed model, illustrating its effectiveness for simultaneous variable selection as well as classification. 相似文献

15.

Mixtures of Gaussian copula factor analyzers for clustering high dimensional data

《Journal of the Korean Statistical Society》2019,48(3):480-492

Mixtures of factor analyzers is a useful model-based clustering method which can avoid the curse of dimensionality in high-dimensional clustering. However, this approach is sensitive to both diverse non-normalities of marginal variables and outliers, which are commonly observed in multivariate experiments. We propose mixtures of Gaussian copula factor analyzers (MGCFA) for clustering high-dimensional clustering. This model has two advantages; (1) it allows different marginal distributions to facilitate fitting flexibility of the mixture model, (2) it can avoid the curse of dimensionality by embedding the factor-analytic structure in the component-correlation matrices of the mixture distribution.An EM algorithm is developed for the fitting of MGCFA. The proposed method is free of the curse of dimensionality and allows any parametric marginal distribution which fits best to the data. It is applied to both synthetic data and a microarray gene expression data for clustering and shows its better performance over several existing methods. 相似文献

16.

Goodness-of-fit test for Gaussian regression with block correlated errors

S. Huet 《Statistics》2015,49(2):239-266

We propose a procedure to test that the expectation of a Gaussian vector is linear against a nonparametric alternative. We consider the case where the covariance matrix of the observations has a block diagonal structure. This framework encompasses regression models with autocorrelated errors, heteroscedastic regression models, mixed-effects models and growth curves. Our procedure does not depend on any prior information about the alternative. We prove that the test is asymptotically of the nominal level and consistent. We characterize the set of vectors on which the test is powerful and prove the classical √log log (n)/n convergence rate over directional alternatives. We propose a bootstrap version of the test as an alternative to the initial one and provide a simulation study in order to evaluate both procedures for small sample sizes when the purpose is to test goodness of fit in a Gaussian mixed-effects model. Finally, we illustrate the procedures using a real data set. 相似文献

17.

Using conditional independence for parsimonious model-based Gaussian clustering

Giuliano Galimberti Gabriele Soffritti 《Statistics and Computing》2013,23(5):625-638

In the framework of model-based cluster analysis, finite mixtures of Gaussian components represent an important class of statistical models widely employed for dealing with quantitative variables. Within this class, we propose novel models in which constraints on the component-specific variance matrices allow us to define Gaussian parsimonious clustering models. Specifically, the proposed models are obtained by assuming that the variables can be partitioned into groups resulting to be conditionally independent within components, thus producing component-specific variance matrices with a block diagonal structure. This approach allows us to extend the methods for model-based cluster analysis and to make them more flexible and versatile. In this paper, Gaussian mixture models are studied under the above mentioned assumption. Identifiability conditions are proved and the model parameters are estimated through the maximum likelihood method by using the Expectation-Maximization algorithm. The Bayesian information criterion is proposed for selecting the partition of the variables into conditionally independent groups. The consistency of the use of this criterion is proved under regularity conditions. In order to examine and compare models with different partitions of the set of variables a hierarchical algorithm is suggested. A wide class of parsimonious Gaussian models is also presented by parameterizing the component-variance matrices according to their spectral decomposition. The effectiveness and usefulness of the proposed methodology are illustrated with two examples based on real datasets. 相似文献

18.

The false discovery rate: a variable selection perspective

《Journal of statistical planning and inference》2006,136(8):2668-2684

相似文献

19.

A comparison of methods for the fitting of generalized additive models

Harald Binder Gerhard Tutz 《Statistics and Computing》2008,18(1):87-99

There are several procedures for fitting generalized additive models, i.e. regression models for an exponential family response where the influence of each single covariates is assumed to have unknown, potentially non-linear shape. Simulated data are used to compare a smoothing parameter optimization approach for selection of smoothness and of covariates, a stepwise approach, a mixed model approach, and a procedure based on boosting techniques. In particular it is investigated how the performance of procedures is linked to amount of information, type of response, total number of covariates, number of influential covariates, and extent of non-linearity. Measures for comparison are prediction performance, identification of influential covariates, and smoothness of fitted functions. One result is that the mixed model approach returns sparse fits with frequently over-smoothed functions, while the functions are less smooth for the boosting approach and variable selection is less strict. The other approaches are in between with respect to these measures. The boosting procedure is seen to perform very well when little information is available and/or when a large number of covariates is to be investigated. It is somewhat surprising that in scenarios with low information the fitting of a linear model, even with stepwise variable selection, has not much advantage over the fitting of an additive model when the true underlying structure is linear. In cases with more information the prediction performance of all procedures is very similar. So, in difficult data situations the boosting approach can be recommended, in others the procedures can be chosen conditional on the aim of the analysis. 相似文献

20.

On the impact of contaminations in graphical Gaussian models

Anna Gottard Simona Pacillo 《Statistical Methods and Applications》2007,15(3):343-354

This paper analyzes the impact of some kinds of contaminant on model selection in graphical Gaussian models. We investigate four different kinds of contaminants, in order to consider the effect of gross errors, model deviations, and model misspecification. The aim of the work is to assess against which kinds of contaminant a model selection procedure for graphical Gaussian models has a more robust behavior. The analysis is based on simulated data. The simulation study shows that relatively few contaminated observations in even just one of the variables can have a significant impact on correct model selection, especially when the contaminated variable is a node in a separating set of the graph. 相似文献