首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
We propose a mixture of latent variables model for the model-based clustering, classification, and discriminant analysis of data comprising variables with mixed type. This approach is a generalization of latent variable analysis, and model fitting is carried out within the expectation-maximization framework. Our approach is outlined and a simulation study conducted to illustrate the effect of sample size and noise on the standard errors and the recovery probabilities for the number of groups. Our modelling methodology is then applied to two real data sets and their clustering and classification performance is discussed. We conclude with discussion and suggestions for future work.  相似文献   

2.
Finite mixtures of multivariate skew t (MST) distributions have proven to be useful in modelling heterogeneous data with asymmetric and heavy tail behaviour. Recently, they have been exploited as an effective tool for modelling flow cytometric data. A number of algorithms for the computation of the maximum likelihood (ML) estimates for the model parameters of mixtures of MST distributions have been put forward in recent years. These implementations use various characterizations of the MST distribution, which are similar but not identical. While exact implementation of the expectation-maximization (EM) algorithm can be achieved for ‘restricted’ characterizations of the component skew t-distributions, Monte Carlo (MC) methods have been used to fit the ‘unrestricted’ models. In this paper, we review several recent fitting algorithms for finite mixtures of multivariate skew t-distributions, at the same time clarifying some of the connections between the various existing proposals. In particular, recent results have shown that the EM algorithm can be implemented exactly for faster computation of ML estimates for mixtures with unrestricted MST components. The gain in computational time is effected by noting that the semi-infinite integrals on the E-step of the EM algorithm can be put in the form of moments of the truncated multivariate non-central t-distribution, similar to the restricted case, which subsequently can be expressed in terms of the non-truncated form of the central t-distribution function for which fast algorithms are available. We present comparisons to illustrate the relative performance of the restricted and unrestricted models, and demonstrate the usefulness of the recently proposed methodology for the unrestricted MST mixture, by some applications to three real datasets.  相似文献   

3.
A novel family of mixture models is introduced based on modified t-factor analyzers. Modified factor analyzers were recently introduced within the Gaussian context and our work presents a more flexible and robust alternative. We introduce a family of mixtures of modified t-factor analyzers that uses this generalized version of the factor analysis covariance structure. We apply this family within three paradigms: model-based clustering; model-based classification; and model-based discriminant analysis. In addition, we apply the recently published Gaussian analogue to this family under the model-based classification and discriminant analysis paradigms for the first time. Parameter estimation is carried out within the alternating expectation-conditional maximization framework and the Bayesian information criterion is used for model selection. Two real data sets are used to compare our approach to other popular model-based approaches; in these comparisons, the chosen mixtures of modified t-factor analyzers model performs favourably. We conclude with a summary and suggestions for future work.  相似文献   

4.
5.
6.
Motivated by classification issues that arise in marine studies, we propose a latent-class mixture model for the unsupervised classification of incomplete quadrivariate data with two linear and two circular components. The model integrates bivariate circular densities and bivariate skew normal densities to capture the association between toroidal clusters of bivariate circular observations and planar clusters of bivariate linear observations. Maximum-likelihood estimation of the model is facilitated by an expectation maximization (EM) algorithm that treats unknown class membership and missing values as different sources of incomplete information. The model is exploited on hourly observations of wind speed and direction and wave height and direction to identify a number of sea regimes, which represent specific distributional shapes that the data take under environmental latent conditions.  相似文献   

7.
We design a probability distribution for ordinal data by modeling the process generating data, which is assumed to rely only on order comparisons between categories. Contrariwise, most competitors often either forget the order information or add a non-existent distance information. The data generating process is assumed, from optimality arguments, to be a stochastic binary search algorithm in a sorted table. The resulting distribution is natively governed by two meaningful parameters (position and precision) and has very appealing properties: decrease around the mode, shape tuning from uniformity to a Dirac, identifiability. Moreover, it is easily estimated by an EM algorithm since the path in the stochastic binary search algorithm can be considered as missing values. Using then the classical latent class assumption, the previous univariate ordinal model is straightforwardly extended to model-based clustering for multivariate ordinal data. Parameters of this mixture model are estimated by an AECM algorithm. Both simulated and real data sets illustrate the great potential of this model by its ability to parsimoniously identify particularly relevant clusters which were unsuspected by some traditional competitors.  相似文献   

8.
AStA Advances in Statistical Analysis - Mixtures of t-factor analyzers have been broadly used for model-based density estimation and clustering of high-dimensional data from a heterogeneous...  相似文献   

9.
10.
The purpose of this paper is to examine the multiple group (>2) discrimination problem in which the group sizes are unequal and the variables used in the classification are correlated with skewed distributions. Using statistical simulation based on data from a clinical study, we compare the performances, in terms of misclassification rates, of nine statistical discrimination methods. These methods are linear and quadratic discriminant analysis applied to untransformed data, rank transformed data, and inverse normal scores data, as well as fixed kernel discriminant analysis, variable kernel discriminant analysis, and variable kernel discriminant analysis applied to inverse normal scores data. It is found that the parametric methods with transformed data generally outperform the other methods, and the parametric methods applied to inverse normal scores usually outperform the parametric methods applied to rank transformed data. Although the kernel methods often have very biased estimates, the variable kernel method applied to inverse normal scores data provides considerable improvement in terms of total nonerror rate.  相似文献   

11.
We propose a hybrid two-group classification method that integrates linear discriminant analysis, a polynomial expansion of the basis (or variable space), and a genetic algorithm with multiple crossover operations to select variables from the expanded basis. Using new product launch data from the biochemical industry, we found that the proposed algorithm offers mean percentage decreases in the misclassification error rate of 50%, 56%, 59%, 77%, and 78% in comparison to a support vector machine, artificial neural network, quadratic discriminant analysis, linear discriminant analysis, and logistic regression, respectively. These improvements correspond to annual cost savings of $4.40–$25.73 million.  相似文献   

12.
In this study, a new per-field classification method is proposed for supervised classification of remotely sensed multispectral image data of an agricultural area using Gaussian mixture discriminant analysis (MDA). For the proposed per-field classification method, multivariate Gaussian mixture models constructed for control and test fields can have fixed or different number of components and each component can have different or common covariance matrix structure. The discrimination function and the decision rule of this method are established according to the average Bhattacharyya distance and the minimum values of the average Bhattacharyya distances, respectively. The proposed per-field classification method is analyzed for different structures of a covariance matrix with fixed and different number of components. Also, we classify the remotely sensed multispectral image data using the per-pixel classification method based on Gaussian MDA.  相似文献   

13.
Model-based clustering of Gaussian copulas for mixed data   总被引:1,自引:0,他引:1  
Clustering of mixed data is important yet challenging due to a shortage of conventional distributions for such data. In this article, we propose a mixture model of Gaussian copulas for clustering mixed data. Indeed copulas, and Gaussian copulas in particular, are powerful tools for easily modeling the distribution of multivariate variables. This model clusters data sets with continuous, integer, and ordinal variables (all having a cumulative distribution function) by considering the intra-component dependencies in a similar way to the Gaussian mixture. Indeed, each component of the Gaussian copula mixture produces a correlation coefficient for each pair of variables and its univariate margins follow standard distributions (Gaussian, Poisson, and ordered multinomial) depending on the nature of the variable (continuous, integer, or ordinal). As an interesting by-product, this model generalizes many well-known approaches and provides tools for visualization based on its parameters. The Bayesian inference is achieved with a Metropolis-within-Gibbs sampler. The numerical experiments, on simulated and real data, illustrate the benefits of the proposed model: flexible and meaningful parameterization combined with visualization features.  相似文献   

14.
Abstract

We propose a statistical method for clustering multivariate longitudinal data into homogeneous groups. This method relies on a time-varying extension of the classical K-means algorithm, where a multivariate vector autoregressive model is additionally assumed for modeling the evolution of clusters' centroids over time. Model inference is based on a least-squares method and on a coordinate descent algorithm. To illustrate our work, we consider a longitudinal dataset on human development. Three variables are modeled, namely life expectancy, education and gross domestic product.  相似文献   

15.
In the last decade, much effort has been spent on modelling dependence between sensory variables and chemical–physical ones, especially when observed at different occasions/spaces/times or if collected from several groups (blocks) of variables. In this paper, we propose a nonlinear generalization of multi-block partial least squares with the inclusion of variable interactions. We show the performance of the method on a known data set.  相似文献   

16.
Model-based clustering is a method that clusters data with an assumption of a statistical model structure. In this paper, we propose a novel model-based hierarchical clustering method for a finite statistical mixture model based on the Fisher distribution. The main foci of the proposed method are: (a) provide efficient solution to estimate the parameters of a Fisher mixture model (FMM); (b) generate a hierarchy of FMMs and (c) select the optimal model. To this aim, we develop a Bregman soft clustering method for FMM. Our model estimation strategy exploits Bregman divergence and hierarchical agglomerative clustering. Whereas, our model selection strategy comprises a parsimony-based approach and an evaluation graph-based approach. We empirically validate our proposed method by applying it on simulated data. Next, we apply the method on real data to perform depth image analysis. We demonstrate that the proposed clustering method can be used as a potential tool for unsupervised depth image analysis.  相似文献   

17.
We propose models to analyze animal growth data with the aim of estimating and predicting quantities of biological and economical interest such as the maturing rate and asymptotic weight. It is also studied the effect of environmental factors of relevant influence in the growth process. The models considered in this paper are based on an extension and specialization of the dynamic hierarchical model (Gamerman & Migon, 1993) to a non–linear growth curve setting, where some of the growth curve parameters are considered exchangeable among the units. The inference for these models are approximate conjugate analysis based on Taylor series expansions and linear Bayes procedures  相似文献   

18.
Consider classifying an n × I observation vector as coming from one of two multivariate normal distributions which differ both in mean vectors and covariance matrices. A class of dis-crimination rules based upon n independent univariate discrim-inate functions is developed yielding exact misclassification probabilities when the population parameters are known. An efficient search of this class to select the procedure with minimum expected misclassification is made by employing an algorithm of the implicit enumeration type used in integer programming. The procedure is applied to the classification of male twins as either monozygotic or dizygotic.  相似文献   

19.
In the present paper we examine finite mixtures of multivariate Poisson distributions as an alternative class of models for multivariate count data. The proposed models allow for both overdispersion in the marginal distributions and negative correlation, while they are computationally tractable using standard ideas from finite mixture modelling. An EM type algorithm for maximum likelihood (ML) estimation of the parameters is developed. The identifiability of this class of mixtures is proved. Properties of ML estimators are derived. A real data application concerning model based clustering for multivariate count data related to different types of crime is presented to illustrate the practical potential of the proposed class of models.  相似文献   

20.
An image that is mapped into a bit stream suitable for communication over or storage in a digital medium is said to have been compressed. Using tree-structured vector quantizers (TSVQs) is an approach to image compression in which clustering algorithms are combined with ideas from tree-structured classification to provide code books that can be searched quickly and simply. The overall goal is to optimize the quality of the compressed image subject to a constraint on the communication or storage capacity, i.e. on the allowed bit rate. General goals of image compression and vector quantization are summarized in this paper. There is discussion of methods for code book design, particularly the generalized Lloyd algorithm for clustering, and methods for splitting and pruning that have been extended from the design of classification trees to TSVQs. The resulting codes, called pruned TSVQs, are of variable rate, and yield lower distortion than fixed-rate, full-search vector quantizers for a given average bit rate. They have simple encoders and a natural successive approximation (progressive) property. Applications of pruned TSVQs are discussed, particularly compressing computerized tomography images. In this work, the key issue is not merely the subjective attractiveness of the compressed image but rather whether the diagnostic accuracy is adversely aflected by compression. In recent work, TSVQs have been combined with other types of image processing, including segmentation and enhancement. The relationship between vector quantizer performance and the size of the training sequence used to design the code and other asymptotic properties of the codes are discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号