首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Food authenticity studies are concerned with determining if food samples have been correctly labelled or not. Discriminant analysis methods are an integral part of the methodology for food authentication. Motivated by food authenticity applications, a model-based discriminant analysis method that includes variable selection is presented. The discriminant analysis model is fitted in a semi-supervised manner using both labeled and unlabeled data. The method is shown to give excellent classification performance on several high-dimensional multiclass food authenticity datasets with more variables than observations. The variables selected by the proposed method provide information about which variables are meaningful for classification purposes. A headlong search strategy for variable selection is shown to be efficient in terms of computation and achieves excellent classification performance. In applications to several food authenticity datasets, our proposed method outperformed default implementations of Random Forests, AdaBoost, transductive SVMs and Bayesian Multinomial Regression by substantial margins.  相似文献   

2.
Model-based classification using latent Gaussian mixture models   总被引:1,自引:0,他引:1  
A novel model-based classification technique is introduced based on parsimonious Gaussian mixture models (PGMMs). PGMMs, which were introduced recently as a model-based clustering technique, arise from a generalization of the mixtures of factor analyzers model and are based on a latent Gaussian mixture model. In this paper, this mixture modelling structure is used for model-based classification and the particular area of application is food authenticity. Model-based classification is performed by jointly modelling data with known and unknown group memberships within a likelihood framework and then estimating parameters, including the unknown group memberships, within an alternating expectation-conditional maximization framework. Model selection is carried out using the Bayesian information criteria and the quality of the maximum a posteriori classifications is summarized using the misclassification rate and the adjusted Rand index. This new model-based classification technique gives excellent classification performance when applied to real food authenticity data on the chemical properties of olive oils from nine areas of Italy.  相似文献   

3.
A novel family of mixture models is introduced based on modified t-factor analyzers. Modified factor analyzers were recently introduced within the Gaussian context and our work presents a more flexible and robust alternative. We introduce a family of mixtures of modified t-factor analyzers that uses this generalized version of the factor analysis covariance structure. We apply this family within three paradigms: model-based clustering; model-based classification; and model-based discriminant analysis. In addition, we apply the recently published Gaussian analogue to this family under the model-based classification and discriminant analysis paradigms for the first time. Parameter estimation is carried out within the alternating expectation-conditional maximization framework and the Bayesian information criterion is used for model selection. Two real data sets are used to compare our approach to other popular model-based approaches; in these comparisons, the chosen mixtures of modified t-factor analyzers model performs favourably. We conclude with a summary and suggestions for future work.  相似文献   

4.
The last decade has seen an explosion of work on the use of mixture models for clustering. The use of the Gaussian mixture model has been common practice, with constraints sometimes imposed upon the component covariance matrices to give families of mixture models. Similar approaches have also been applied, albeit with less fecundity, to classification and discriminant analysis. In this paper, we begin with an introduction to model-based clustering and a succinct account of the state-of-the-art. We then put forth a novel family of mixture models wherein each component is modeled using a multivariate t-distribution with an eigen-decomposed covariance structure. This family, which is largely a t-analogue of the well-known MCLUST family, is known as the tEIGEN family. The efficacy of this family for clustering, classification, and discriminant analysis is illustrated with both real and simulated data. The performance of this family is compared to its Gaussian counterpart on three real data sets.  相似文献   

5.
Mixture model-based clustering is widely used in many applications. In certain real-time applications the rapid increase of data size with time makes classical clustering algorithms too slow. An online clustering algorithm based on mixture models is presented in the context of a real-time flaw-diagnosis application for pressurized containers which uses data from acoustic emission signals. The proposed algorithm is a stochastic gradient algorithm derived from the classification version of the EM algorithm (CEM). It provides a model-based generalization of the well-known online k-means algorithm, able to handle non-spherical clusters. Using synthetic and real data sets, the proposed algorithm is compared with the batch CEM algorithm and the online EM algorithm. The three approaches generate comparable solutions in terms of the resulting partition when clusters are relatively well separated, but online algorithms become faster as the size of the available observations increases.  相似文献   

6.
We propose a mixture of latent variables model for the model-based clustering, classification, and discriminant analysis of data comprising variables with mixed type. This approach is a generalization of latent variable analysis, and model fitting is carried out within the expectation-maximization framework. Our approach is outlined and a simulation study conducted to illustrate the effect of sample size and noise on the standard errors and the recovery probabilities for the number of groups. Our modelling methodology is then applied to two real data sets and their clustering and classification performance is discussed. We conclude with discussion and suggestions for future work.  相似文献   

7.
We examined the impact of different methods for replacing missing data in discriminant analyses conducted on randomly generated samples from multivariate normal and non-normal distributions. The probabilities of correct classification were obtained for these discriminant analyses before and after randomly deleting data as well as after deleted data were replaced using: (1) variable means, (2) principal component projections, and (3) the EM algorithm. Populations compared were: (1) multivariate normal with covariance matrices ∑1=∑2, (2) multivariate normal with ∑1≠∑2 and (3) multivariate non-normal with ∑1=∑2. Differences in the probabilities of correct classification were most evident for populations with small Mahalanobis distances or high proportions of missing data. The three replacement methods performed similarly but all were better than non - replacement.  相似文献   

8.
The problem of constructing classification methods based on both labeled and unlabeled data sets is considered for analyzing data with complex structures. We introduce a semi-supervised logistic discriminant model with Gaussian basis expansions. Unknown parameters included in the logistic model are estimated by regularization method along with the technique of EM algorithm. For selection of adjusted parameters, we derive a model selection criterion from Bayesian viewpoints. Numerical studies are conducted to investigate the effectiveness of our proposed modeling procedures.  相似文献   

9.
We consider the supervised classification setting, in which the data consist of p features measured on n observations, each of which belongs to one of K classes. Linear discriminant analysis (LDA) is a classical method for this problem. However, in the high-dimensional setting where p ? n, LDA is not appropriate for two reasons. First, the standard estimate for the within-class covariance matrix is singular, and so the usual discriminant rule cannot be applied. Second, when p is large, it is difficult to interpret the classification rule obtained from LDA, since it involves all p features. We propose penalized LDA, a general approach for penalizing the discriminant vectors in Fisher's discriminant problem in a way that leads to greater interpretability. The discriminant problem is not convex, so we use a minorization-maximization approach in order to efficiently optimize it when convex penalties are applied to the discriminant vectors. In particular, we consider the use of L(1) and fused lasso penalties. Our proposal is equivalent to recasting Fisher's discriminant problem as a biconvex problem. We evaluate the performances of the resulting methods on a simulation study, and on three gene expression data sets. We also survey past methods for extending LDA to the high-dimensional setting, and explore their relationships with our proposal.  相似文献   

10.
The purpose of this paper is to examine the multiple group (>2) discrimination problem in which the group sizes are unequal and the variables used in the classification are correlated with skewed distributions. Using statistical simulation based on data from a clinical study, we compare the performances, in terms of misclassification rates, of nine statistical discrimination methods. These methods are linear and quadratic discriminant analysis applied to untransformed data, rank transformed data, and inverse normal scores data, as well as fixed kernel discriminant analysis, variable kernel discriminant analysis, and variable kernel discriminant analysis applied to inverse normal scores data. It is found that the parametric methods with transformed data generally outperform the other methods, and the parametric methods applied to inverse normal scores usually outperform the parametric methods applied to rank transformed data. Although the kernel methods often have very biased estimates, the variable kernel method applied to inverse normal scores data provides considerable improvement in terms of total nonerror rate.  相似文献   

11.
In this study, a new per-field classification method is proposed for supervised classification of remotely sensed multispectral image data of an agricultural area using Gaussian mixture discriminant analysis (MDA). For the proposed per-field classification method, multivariate Gaussian mixture models constructed for control and test fields can have fixed or different number of components and each component can have different or common covariance matrix structure. The discrimination function and the decision rule of this method are established according to the average Bhattacharyya distance and the minimum values of the average Bhattacharyya distances, respectively. The proposed per-field classification method is analyzed for different structures of a covariance matrix with fixed and different number of components. Also, we classify the remotely sensed multispectral image data using the per-pixel classification method based on Gaussian MDA.  相似文献   

12.
Mixtures of factor analyzers is a useful model-based clustering method which can avoid the curse of dimensionality in high-dimensional clustering. However, this approach is sensitive to both diverse non-normalities of marginal variables and outliers, which are commonly observed in multivariate experiments. We propose mixtures of Gaussian copula factor analyzers (MGCFA) for clustering high-dimensional clustering. This model has two advantages; (1) it allows different marginal distributions to facilitate fitting flexibility of the mixture model, (2) it can avoid the curse of dimensionality by embedding the factor-analytic structure in the component-correlation matrices of the mixture distribution.An EM algorithm is developed for the fitting of MGCFA. The proposed method is free of the curse of dimensionality and allows any parametric marginal distribution which fits best to the data. It is applied to both synthetic data and a microarray gene expression data for clustering and shows its better performance over several existing methods.  相似文献   

13.
Model-based clustering using copulas with applications   总被引:1,自引:0,他引:1  
The majority of model-based clustering techniques is based on multivariate normal models and their variants. In this paper copulas are used for the construction of flexible families of models for clustering applications. The use of copulas in model-based clustering offers two direct advantages over current methods: (i) the appropriate choice of copulas provides the ability to obtain a range of exotic shapes for the clusters, and (ii) the explicit choice of marginal distributions for the clusters allows the modelling of multivariate data of various modes (either discrete or continuous) in a natural way. This paper introduces and studies the framework of copula-based finite mixture models for clustering applications. Estimation in the general case can be performed using standard EM, and, depending on the mode of the data, more efficient procedures are provided that can fully exploit the copula structure. The closure properties of the mixture models under marginalization are discussed, and for continuous, real-valued data parametric rotations in the sample space are introduced, with a parallel discussion on parameter identifiability depending on the choice of copulas for the components. The exposition of the methodology is accompanied and motivated by the analysis of real and artificial data.  相似文献   

14.
Parameters of a finite mixture model are often estimated by the expectation–maximization (EM) algorithm where the observed data log-likelihood function is maximized. This paper proposes an alternative approach for fitting finite mixture models. Our method, called the iterative Monte Carlo classification (IMCC), is also an iterative fitting procedure. Within each iteration, it first estimates the membership probabilities for each data point, namely the conditional probability of a data point belonging to a particular mixing component given that the data point value is obtained, it then classifies each data point into a component distribution using the estimated conditional probabilities and the Monte Carlo method. It finally updates the parameters of each component distribution based on the classified data. Simulation studies were conducted to compare IMCC with some other algorithms for fitting mixture normal, and mixture t, densities.  相似文献   

15.
The problem of two-group classification has implications in a number of fields, such as medicine, finance, and economics. This study aims to compare the methods of two-group classification. The minimum sum of deviations and linear programming model, linear discriminant analysis, quadratic discriminant analysis and logistic regression, multivariate analysis of variance (MANOVA) test-based classification and the unpooled T-square test-based classification methods, support vector machines and k-nearest neighbor methods, and combined classification method will be compared for data structures having fat-tail and/or skewness. The comparison has been carried out by using a simulation procedure designed for various stable distribution structures and sample sizes.  相似文献   

16.
Model-based clustering methods for continuous data are well established and commonly used in a wide range of applications. However, model-based clustering methods for categorical data are less standard. Latent class analysis is a commonly used method for model-based clustering of binary data and/or categorical data, but due to an assumed local independence structure there may not be a correspondence between the estimated latent classes and groups in the population of interest. The mixture of latent trait analyzers model extends latent class analysis by assuming a model for the categorical response variables that depends on both a categorical latent class and a continuous latent trait variable; the discrete latent class accommodates group structure and the continuous latent trait accommodates dependence within these groups. Fitting the mixture of latent trait analyzers model is potentially difficult because the likelihood function involves an integral that cannot be evaluated analytically. We develop a variational approach for fitting the mixture of latent trait models and this provides an efficient model fitting strategy. The mixture of latent trait analyzers model is demonstrated on the analysis of data from the National Long Term Care Survey (NLTCS) and voting in the U.S. Congress. The model is shown to yield intuitive clustering results and it gives a much better fit than either latent class analysis or latent trait analysis alone.  相似文献   

17.
Some bootstrap and boosting methods for problems related to classification are introduced in this article. The first method chooses better boosting weights by using a bootstrap search algorithm. The second method is a good way to define a classification frontier. A new formulation for boosting in linear discriminant analysis is given. Since in this new formulation the uncertainty is represented by the weighted covariance matrix, it is more appropriate from the conceptual point of view. Simulation results show that the proposed methods perform well in data analysis.  相似文献   

18.
The development of models and methods for cure rate estimation has recently burgeoned into an important subfield of survival analysis. Much of the literature focuses on the standard mixture model. Recently, process-based models have been suggested. We focus on several models based on first passage times for Wiener processes. Whitmore and others have studied these models in a variety of contexts. Lee and Whitmore (Stat Sci 21(4):501–513, 2006) give a comprehensive review of a variety of first hitting time models and briefly discuss their potential as cure rate models. In this paper, we study the Wiener process with negative drift as a possible cure rate model but the resulting defective inverse Gaussian model is found to provide a poor fit in some cases. Several possible modifications are then suggested, which improve the defective inverse Gaussian. These modifications include: the inverse Gaussian cure rate mixture model; a mixture of two inverse Gaussian models; incorporation of heterogeneity in the drift parameter; and the addition of a second absorbing barrier to the Wiener process, representing an immunity threshold. This class of process-based models is a useful alternative to the standard model and provides an improved fit compared to the standard model when applied to many of the datasets that we have studied. Implementation of this class of models is facilitated using expectation-maximization (EM) algorithms and variants thereof, including the gradient EM algorithm. Parameter estimates for each of these EM algorithms are given and the proposed models are applied to both real and simulated data, where they perform well.  相似文献   

19.
We consider the study of censored survival times in the situation where the available data consist of both eligible and ineligible subjects, and information distinguishing the two groups is sometimes missing. A complete-case analysis in this context would use only subjects known to be eligible, resulting in inefficient and potentially biased estimators. We propose a two-step procedure which resembles the EM algorithm but is computationally much faster. In the first step, one estimates the conditional expectation of the missing eligibility indicators given the observed data using a logistic regression based on the complete cases (i.e., subjects with non-missing eligibility indicator). In the second step, maximum likelihood estimators are obtained from a weighted Cox proportional hazards model, with the weights being either observed eligibility indicators or estimated conditional expectations thereof. Under ignorable missingness, the estimators from the second step are proven to be consistent and asymptotically normal, with explicit variance estimators. We demonstrate through simulation that the proposed methods perform well for moderate sized samples and are robust in the presence of eligibility indicators that are missing not at random. The proposed procedure is more efficient and more robust than the complete case analysis and, unlike the EM algorithm, does not require time-consuming iteration. Although the proposed methods are applicable generally, they would be most useful for large data sets (e.g., administrative data), for which the computational savings outweigh the price one has to pay for making various approximations in avoiding iteration. We apply the proposed methods to national kidney transplant registry data.  相似文献   

20.
Population size estimation with discrete or nonparametric mixture models is considered, and reliable ways of construction of the nonparametric mixture model estimator are reviewed and set into perspective. Construction of the maximum likelihood estimator of the mixing distribution is done for any number of components up to the global nonparametric maximum likelihood bound using the EM algorithm. In addition, the estimators of Chao and Zelterman are considered with some generalisations of Zelterman’s estimator. All computations are done with CAMCR, a special software developed for population size estimation with mixture models. Several examples and data sets are discussed and the estimators illustrated. Problems using the mixture model-based estimators are highlighted.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号