首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Cluster analysis is the automated search for groups of homogeneous observations in a data set. A popular modeling approach for clustering is based on finite normal mixture models, which assume that each cluster is modeled as a multivariate normal distribution. However, the normality assumption that each component is symmetric is often unrealistic. Furthermore, normal mixture models are not robust against outliers; they often require extra components for modeling outliers and/or give a poor representation of the data. To address these issues, we propose a new class of distributions, multivariate t distributions with the Box-Cox transformation, for mixture modeling. This class of distributions generalizes the normal distribution with the more heavy-tailed t distribution, and introduces skewness via the Box-Cox transformation. As a result, this provides a unified framework to simultaneously handle outlier identification and data transformation, two interrelated issues. We describe an Expectation-Maximization algorithm for parameter estimation along with transformation selection. We demonstrate the proposed methodology with three real data sets and simulation studies. Compared with a wealth of approaches including the skew-t mixture model, the proposed t mixture model with the Box-Cox transformation performs favorably in terms of accuracy in the assignment of observations, robustness against model misspecification, and selection of the number of components.  相似文献   

2.
Multivariate mixture regression models can be used to investigate the relationships between two or more response variables and a set of predictor variables by taking into consideration unobserved population heterogeneity. It is common to take multivariate normal distributions as mixing components, but this mixing model is sensitive to heavy-tailed errors and outliers. Although normal mixture models can approximate any distribution in principle, the number of components needed to account for heavy-tailed distributions can be very large. Mixture regression models based on the multivariate t distributions can be considered as a robust alternative approach. Missing data are inevitable in many situations and parameter estimates could be biased if the missing values are not handled properly. In this paper, we propose a multivariate t mixture regression model with missing information to model heterogeneity in regression function in the presence of outliers and missing values. Along with the robust parameter estimation, our proposed method can be used for (i) visualization of the partial correlation between response variables across latent classes and heterogeneous regressions, and (ii) outlier detection and robust clustering even under the presence of missing values. We also propose a multivariate t mixture regression model using MM-estimation with missing information that is robust to high-leverage outliers. The proposed methodologies are illustrated through simulation studies and real data analysis.  相似文献   

3.
The majority of the existing literature on model-based clustering deals with symmetric components. In some cases, especially when dealing with skewed subpopulations, the estimate of the number of groups can be misleading; if symmetric components are assumed we need more than one component to describe an asymmetric group. Existing mixture models, based on multivariate normal distributions and multivariate t distributions, try to fit symmetric distributions, i.e. they fit symmetric clusters. In the present paper, we propose the use of finite mixtures of the normal inverse Gaussian distribution (and its multivariate extensions). Such finite mixture models start from a density that allows for skewness and fat tails, generalize the existing models, are tractable and have desirable properties. We examine both the univariate case, to gain insight, and the multivariate case, which is more useful in real applications. EM type algorithms are described for fitting the models. Real data examples are used to demonstrate the potential of the new model in comparison with existing ones.  相似文献   

4.
Robust mixture modelling using the t distribution   总被引:2,自引:0,他引:2  
Normal mixture models are being increasingly used to model the distributions of a wide variety of random phenomena and to cluster sets of continuous multivariate data. However, for a set of data containing a group or groups of observations with longer than normal tails or atypical observations, the use of normal components may unduly affect the fit of the mixture model. In this paper, we consider a more robust approach by modelling the data by a mixture of t distributions. The use of the ECM algorithm to fit this t mixture model is described and examples of its use are given in the context of clustering multivariate data in the presence of atypical observations in the form of background noise.  相似文献   

5.
Izenman and Sommer (1988) used a non-parametric kernel density estimation technique to fit a seven-component model to the paper thickness of the 1872 Hidalgo stamp issue of Mexico. They observed an apparent conflict when fitting a normal mixture model with three components with unequal variances. This conflict is examined further by investigating the most appropriate number of components when fitting a normal mixture of components with equal variances.  相似文献   

6.
This article presents the “centered” method for establishing cell boundaries in the X 2 goodness-of-fit test, which when applied to common stock returns significantly reduces the high bias of the test statistic associated with the traditional Mann–Wald equiprobable approach. A modified null hypothesis is proposed to incorporate explicitly the usually implicit assumption that the observed discrete returns are “approximated” by the hypothesized continuous density. Simulation results indicate extremely biased X 2 values resulting from the traditional approach, particularly for low-priced and low volatile stocks. Daily stock returns for 114 firms are tested to determine whether they are approximated by a normal or one of several normal mixture densities. Results indicate a significantly higher degree of fit than that reported elsewhere to date.  相似文献   

7.
Mixture models are used in a large number of applications yet there remain difficulties with maximum likelihood estimation. For instance, the likelihood surface for finite normal mixtures often has a large number of local maximizers, some of which do not give a good representation of the underlying features of the data. In this paper we present diagnostics that can be used to check the quality of an estimated mixture distribution. Particular attention is given to normal mixture models since they frequently arise in practice. We use the diagnostic tools for finite normal mixture problems and in the nonparametric setting where the difficult problem of determining a scale parameter for a normal mixture density estimate is considered. A large sample justification for the proposed methodology will be provided and we illustrate its implementation through several examples  相似文献   

8.
We study the association between bone mineral density (BMD) and body mass index (BMI) when contingency tables are constructed from the several U.S. counties, where BMD has three levels (normal, osteopenia and osteoporosis) and BMI has four levels (underweight, normal, overweight and obese). We use the Bayes factor (posterior odds divided by prior odds or equivalently the ratio of the marginal likelihoods) to construct the new test. Like the chi-squared test and Fisher's exact test, we have a direct Bayes test which is a standard test using data from each county. In our main contribution, for each county techniques of small area estimation are used to borrow strength across counties and a pooled test of independence of BMD and BMI is obtained using a hierarchical Bayesian model. Our pooled Bayes test is computed by performing a Monte Carlo integration using random samples rather than Gibbs samples. We have seen important differences among the pooled Bayes test, direct Bayes test and the Cressie-Read test that allows for some degree of sparseness, when the degree of evidence against independence is studied. As expected, we also found that the direct Bayes test is sensitive to the prior specifications but the pooled Bayes test is not so sensitive. Moreover, the pooled Bayes test has competitive power properties, and it is superior when the cell counts are small to moderate.  相似文献   

9.
Abstract. The zero‐inflated Poisson regression model is a special case of finite mixture models that is useful for count data containing many zeros. Typically, maximum likelihood (ML) estimation is used for fitting such models. However, it is well known that the ML estimator is highly sensitive to the presence of outliers and can become unstable when mixture components are poorly separated. In this paper, we propose an alternative robust estimation approach, robust expectation‐solution (RES) estimation. We compare the RES approach with an existing robust approach, minimum Hellinger distance (MHD) estimation. Simulation results indicate that both methods improve on ML when outliers are present and/or when the mixture components are poorly separated. However, the RES approach is more efficient in all the scenarios we considered. In addition, the RES method is shown to yield consistent and asymptotically normal estimators and, in contrast to MHD, can be applied quite generally.  相似文献   

10.
We consider the test of the null hypothesis that the largest mean in a mixture of an unknown number of normal components is less than or equal to a given threshold. This test is motivated by the problem of assessing whether the Soviet Union has been operating in compliance with the Nuclear Test Ban Treaty. In our analysis, the number of normal components is determined using Akaike's Information Criterion while the hypothesis test itself is based on asymptotic results given by Behboodian for a mixture of two normal components. A bootstrap approach is also considered for estimating the standard error of the largest estimated mean. The performance of the testa are examined through the use of simulation.  相似文献   

11.
Two commonly used approximations for the inverse distribution function of the normal distribution are Schmeiser's and Shore's. Both approximations are based on a power transformation of either the cumulative density function (CDF) or a simple function of it. In this note we demonstrate, that if these approximations are presented in the form of the classical one-parameter Box-Cox transformation, and the exponent of the transformation is expressed as a simple function of the CDF, then the accuracy of both approximations may be considerably enhanced, without losing much in algebraic simplicity. Since both approximations are special cases of more general four-parameter systems of distributions, the results presented here indicate that the accuracy of the latter, when used to represent non-normal density functions, may also be considerably enhanced.  相似文献   

12.
This article deals with a semisupervised learning based on naive Bayes assumption. A univariate Gaussian mixture density is used for continuous input variables whereas a histogram type density is adopted for discrete input variables. The EM algorithm is used for the computation of maximum likelihood estimators of parameters in the model when we fix the number of mixing components for each continuous input variable. We carry out a model selection for choosing a parsimonious model among various fitted models based on an information criterion. A common density method is proposed for the selection of significant input variables. Simulated and real datasets are used to illustrate the performance of the proposed method.  相似文献   

13.
Maclean et al. (1976) applied a specific Box-Cox transformation to test for mixtures of distributions against a single distribution. Their null hypothesis is that a sample of n observations is from a normal distribution with unknown mean and variance after a restricted Box-Cox transformation. The alternative is that the sample is from a mixture of two normal distributions, each with unknown mean and unknown, but equal, variance after another restricted Box-Cox transformation. We developed a computer program that calculated the maximum likelihood estimates (MLEs) and likelihood ratio test (LRT) statistic for the above. Our algorithm for the calculation of the MLEs of the unknown parameters used multiple starting points to protect against convergence to a local rather than global maximum. We then simulated the distribution of the LRT for samples drawn from a normal distribution and five Box-Cox transformations of a normal distribution. The null distribution appeared to be the same for the Box-Cox transformations studied and appeared to be distributed as a chi-square random variable for samples of 25 or more. The degrees of freedom parameter appeared to be a monotonically decreasing function of the sample size. The null distribution of this LRT appeared to converge to a chi-square distribution with 2.5 degrees of freedom. We estimated the critical values for the 0.10, 0.05, and 0.01 levels of significance.  相似文献   

14.
This paper describes the derivation of the analytical expression for the integrated squared density partial derivative (ISDPD) in a multivariate normal mixture model. The analytical expression of the ISDPD is derived for arbitrary dimensions with partial derivative orders up to four. Although the value of the ISDPD can be obtained by using the common numerical integration via mathematical software (such as Maple, Mathematica, Matlab, etc), it suffers from the limitation of computation time when the dimension or the number of mixture components of the considered multivariate normal mixture model are large. Moreover, numerical comparison indicates the benefits of speed offered by our proposed analytical expression are far superior to the numerical integration performed by Maple. With this analytical expression, the ISDPD can apace be calculated with no assistance of numerical integration.  相似文献   

15.
Within the context of mixture modeling, the normal distribution is typically used as the components distribution. However, if a cluster is skewed or heavy tailed, then the normal distribution will be inefficient and many may be needed to model a single cluster. In this paper, we present an attempt to solve this problem. We define a cluster, in the absence of further information, to be a group of data which can be modeled by a unimodal density function. Hence, our intention is to use a family of univariate distribution functions, to replace the normal, for which the only constraint is unimodality. With this aim, we devise a new family of nonparametric unimodal distributions, which has large support over the space of univariate unimodal distributions. The difficult aspect of the Bayesian model is to construct a suitable MCMC algorithm to sample from the correct posterior distribution. The key will be the introduction of strategic latent variables and the use of the Product Space view of Reversible Jump methodology.  相似文献   

16.
Recently, mixture distribution becomes more and more popular in many scientific fields. Statistical computation and analysis of mixture models, however, are extremely complex due to the large number of parameters involved. Both EM algorithms for likelihood inference and MCMC procedures for Bayesian analysis have various difficulties in dealing with mixtures with unknown number of components. In this paper, we propose a direct sampling approach to the computation of Bayesian finite mixture models with varying number of components. This approach requires only the knowledge of the density function up to a multiplicative constant. It is easy to implement, numerically efficient and very practical in real applications. A simulation study shows that it performs quite satisfactorily on relatively high dimensional distributions. A well-known genetic data set is used to demonstrate the simplicity of this method and its power for the computation of high dimensional Bayesian mixture models.  相似文献   

17.
Sequences of independent random variables are observed and on the basis of these observations future values of the process are forecast. The Bayesian predictive density of k future observations for normal, exponential, and binomial sequences which change exactly once are analyzed for several cases. It is seen that the Bayesian predictive densities are mixtures of standard probability distributions. For example, with normal sequences the Bayesian predictive density is a mixture of either normal or t-distributions, depending on whether or not the common variance is known. The mixing probabilities are the same as those occurring in the corresponding posterior distribution of the mean(s) of the sequence. The predictive mass function of the number of future successes that will occur in a changing Bernoulli sequence is computed and point and interval predictors are illustrated.  相似文献   

18.
In this work, we develop modeling and estimation approach for the analysis of cross-sectional clustered data with multimodal conditional distributions where the main interest is in analysis of subpopulations. It is proposed to model such data in a hierarchical model with conditional distributions viewed as finite mixtures of normal components. With a large number of observations in the lowest level clusters, a two-stage estimation approach is used. In the first stage, the normal mixture parameters in each lowest level cluster are estimated using robust methods. Robust alternatives to the maximum likelihood estimation are used to provide stable results even for data with conditional distributions such that their components may not quite meet normality assumptions. Then the lowest level cluster-specific means and standard deviations are modeled in a mixed effects model in the second stage. A small simulation study was conducted to compare performance of finite normal mixture population parameter estimates based on robust and maximum likelihood estimation in stage 1. The proposed modeling approach is illustrated through the analysis of mice tendon fibril diameters data. Analyses results address genotype differences between corresponding components in the mixtures and demonstrate advantages of robust estimation in stage 1.  相似文献   

19.
We will pursue a Bayesian nonparametric approach in the hierarchical mixture modelling of lifetime data in two situations: density estimation, when the distribution is a mixture of parametric densities with a nonparametric mixing measure, and accelerated failure time (AFT) regression modelling, when the same type of mixture is used for the distribution of the error term. The Dirichlet process is a popular choice for the mixing measure, yielding a Dirichlet process mixture model for the error; as an alternative, we also allow the mixing measure to be equal to a normalized inverse-Gaussian prior, built from normalized inverse-Gaussian finite dimensional distributions, as recently proposed in the literature. Markov chain Monte Carlo techniques will be used to estimate the predictive distribution of the survival time, along with the posterior distribution of the regression parameters. A comparison between the two models will be carried out on the grounds of their predictive power and their ability to identify the number of components in a given mixture density.  相似文献   

20.
Duplicate analysis is a strategy commonly used to assess precision of bioanalytical methods. In some cases, duplicate analysis may rely on pooling data generated across organizations. Despite being generated under comparable conditions, organizations may produce duplicate measurements with different precision. Thus, these pooled data consist of a heterogeneous collection of duplicate measurements. Precision estimates are often expressed as relative difference indexes (RDI), such as relative percentage difference (RPD). Empirical evidence indicates that the frequency distribution of RDI values from heterogeneous data exhibits sharper peaks and heavier tails than normal distributions. Therefore, traditional normal-based models may yield faulty or unreliable estimates of precision from heterogeneous duplicate data. In this paper, we survey application of the mixture models that satisfactorily represent the distribution of RDI values from heterogeneous duplicate data. A simulation study was conducted to compare the performance of the different models in providing reliable estimates and inferences of percentile calculated from RDI values. These models are readily accessible to practitioners for study implementation through the use of modern statistical software. The utility of mixture models are explained in detail using a numerical example.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号