首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 312 毫秒
1.
Cluster analysis is one of the most widely used method in statistical analyses, in which homogeneous subgroups are identified in a heterogeneous population. Due to the existence of the continuous and discrete mixed data in many applications, so far, some ordinary clustering methods such as, hierarchical methods, k-means and model-based methods have been extended for analysis of mixed data. However, in the available model-based clustering methods, by increasing the number of continuous variables, the number of parameters increases and identifying as well as fitting an appropriate model may be difficult. In this paper, to reduce the number of the parameters, for the model-based clustering mixed data of continuous (normal) and nominal data, a set of parsimonious models is introduced. Models in this set are extended, using the general location model approach, for modeling distribution of mixed variables and applying factor analyzer structure for covariance matrices. The ECM algorithm is used for estimating the parameters of these models. In order to show the performance of the proposed models for clustering, results from some simulation studies and analyzing two real data sets are presented.  相似文献   

2.
The paper gives a review of a number of data models for aggregate statistical data which have appeared in the computer science literature in the last ten years.After a brief introduction to the data model in general, the fundamental concepts of statistical data are introduced. These are called statistical objects because they are complex data structures (vectors, matrices, relations, time series, etc) which may have different possible representations (e.g. tables, relations, vectors, pie-charts, bar-charts, graphs, and so on). For this reason a statistical object is defined by two different types of attribute (a summary attribute, with its own summary type and with its own instances, called summary data, and the set of category attributes, which describe the summary attribute). Some conceptual models of statistical data (CSM, SDM4S), some semantic models of statistical data (SCM, SAM*, OSAM*), and some graphical models of statistical data (SUBJECT, GRASS, STORM) are also discussed.  相似文献   

3.
In recent years, there has been considerable interest in regression models based on zero-inflated distributions. These models are commonly encountered in many disciplines, such as medicine, public health, and environmental sciences, among others. The zero-inflated Poisson (ZIP) model has been typically considered for these types of problems. However, the ZIP model can fail if the non-zero counts are overdispersed in relation to the Poisson distribution, hence the zero-inflated negative binomial (ZINB) model may be more appropriate. In this paper, we present a Bayesian approach for fitting the ZINB regression model. This model considers that an observed zero may come from a point mass distribution at zero or from the negative binomial model. The likelihood function is utilized to compute not only some Bayesian model selection measures, but also to develop Bayesian case-deletion influence diagnostics based on q-divergence measures. The approach can be easily implemented using standard Bayesian software, such as WinBUGS. The performance of the proposed method is evaluated with a simulation study. Further, a real data set is analyzed, where we show that ZINB regression models seems to fit the data better than the Poisson counterpart.  相似文献   

4.
Summary Several techniques for exploring ann×p data set are considered in the light of the statistical framework: data-structure+noise. The first application is to Principal Component Analysis (PCA), in fact generalized PCA with any metric M on the unit space ℝ p . A natural model for supporting this analysis is the fixed-effect model where the expectation of each unit is assumed to belong to some q-dimensional linear manyfold defining the structure, while the variance describes the noise. The best estimation of the structure is obtained for a proper choice of metric M and dimensionality q: guidelines are provided for both choices in section 2. The second application is to Projection Pursuit which aims to reveal structure in the original data by means of suitable low-dimensional projections of them. We suggest the use of generalized PCA with suitable metric M as a Projection Pursuit technique. According to the kind of structure which is looked for, two such metrics are proposed in section 3. Finally, the analysis ofn×p contingency tables is considered in section 4. Since the data are frequencies, we assume a multinomial or Poisson model for the noise. Several models may be considered for the structural part; we can say that Correspondence Analysis rests on one of them, spherical factor analysis on another one; Goodman association models also provide an alternative modelling. These different approaches are discussed and compared from several points of view.  相似文献   

5.
In the present article, we discuss the regression of a point on the surface of a unit sphere in d dimensions given a point on the surface of a unit sphere in p dimensions, where p may not be equal to d. Point projection is added to the rotation and linear transformation for regression link function. The identifiability of the model is proved. Then, parameter estimation in this set up is discussed. Simulation studies and data analyses are done to illustrate the model.  相似文献   

6.
Traditional phylogenetic inference assumes that the history of a set of taxa can be explained by a tree. This assumption is often violated as some biological entities can exchange genetic material giving rise to non‐treelike events often called reticulations. Failure to consider these events might result in incorrectly inferred phylogenies. Phylogenetic networks provide a flexible tool which allows researchers to model the evolutionary history of a set of organisms in the presence of reticulation events. In recent years, a number of methods addressing phylogenetic network parameter estimation have been introduced. Some of them are based on the idea that a phylogenetic network can be defined as a directed acyclic graph. Based on this definition, we propose a Bayesian approach to the estimation of phylogenetic network parameters which allows for different phylogenies to be inferred at different parts of a multiple DNA alignment. The algorithm is tested on simulated data and applied to the ribosomal protein gene rps11 data from five flowering plants, where reticulation events are suspected to be present. The proposed approach can be applied to a wide variety of problems which aim at exploring the possibility of reticulation events in the history of a set of taxa.  相似文献   

7.
Very often, in psychometric research, as in educational assessment, it is necessary to analyze item response from clustered respondents. The multiple group item response theory (IRT) model proposed by Bock and Zimowski [12] provides a useful framework for analyzing such type of data. In this model, the selected groups of respondents are of specific interest such that group-specific population distributions need to be defined. The usual assumption for parameter estimation in this model, which is that the latent traits are random variables following different symmetric normal distributions, has been questioned in many works found in the IRT literature. Furthermore, when this assumption does not hold, misleading inference can result. In this paper, we consider that the latent traits for each group follow different skew-normal distributions, under the centered parameterization. We named it skew multiple group IRT model. This modeling extends the works of Azevedo et al. [4], Bazán et al. [11] and Bock and Zimowski [12] (concerning the latent trait distribution). Our approach ensures that the model is identifiable. We propose and compare, concerning convergence issues, two Monte Carlo Markov Chain (MCMC) algorithms for parameter estimation. A simulation study was performed in order to evaluate parameter recovery for the proposed model and the selected algorithm concerning convergence issues. Results reveal that the proposed algorithm recovers properly all model parameters. Furthermore, we analyzed a real data set which presents asymmetry concerning the latent traits distribution. The results obtained by using our approach confirmed the presence of negative asymmetry for some latent trait distributions.  相似文献   

8.
Representative points (RPs) are a set of points that optimally represents a distribution in terms of mean square error. When the prior data is location biased, the direct methods such as the k-means algorithm may be inefficient to obtain the RPs. In this article, a new indirect algorithm is proposed to search the RPs based on location-biased datasets. Such an algorithm does not constrain the parameter model of the true distribution. The empirical study shows that such algorithm can obtain better RPs than the k-means algorithm.  相似文献   

9.
In some statistical applications, data may not be considered as a random sample of the whole population and some subjects have less probability of belonging to the sample. Consequently, statistical inferences for such data sets, usually yields biased estimation. In such situations, the length-biased version of the original random variable as a special weighted distribution often produces better inferences. An alternative weighted distribution based on the mean residual life is suggested to treat the biasedness. The Rayleigh distribution is applied in many real applications, hence the proposed method of weighting is performed to produce a new lifetime distribution based on the Rayleigh model. In addition, statistical properties of the proposed distribution is investigated. A simulation study and a real data set are prepared to illustrate that the mean residual weighted Rayleigh distribution gives a better fit than the original and also the length-biased Rayleigh distribution.  相似文献   

10.
ABSTRACT

In a test of significance, it is common practice to report the p-value as one way of summarizing the incompatibility between a set of data and a proposed model for the data constructed under a set of assumptions together with a null hypothesis. However, the p-value does have some flaws, one being in general its definition for two-sided tests and a related serious logical one of incoherence, in its interpretation as a statistical measure of evidence for its respective null hypothesis. We shall address these two issues in this article.  相似文献   

11.
ABSTRACT

Consider the problem of estimating the positions of a set of targets in a multidimensional Euclidean space from distances reported by a number of observers when the observers do not know their own positions in the space. Each observer reports the distance from the observer to each target plus a random error. This statistical problem is the basic model for the various forms of what is called multidimensional unfolding in the psychometric literature. Multidimensional unfolding methodology as developed in the field of cognitive psychology is basically a statistical estimation problem where the data structure is a set of measures that are monotonic functions of Euclidean distances between a number of observers and targets in a multidimensional space. The new method presented in this article deals with estimating the target locations and the observer positions when the observations are functions of the squared distances between observers and targets observed with an additive random error in a two-dimensional space. The method provides robust estimates of the target locations in a multidimensional space for the parametric structure of the data generating model presented in the article. The method also yields estimates of the orientation of the coordinate system and the mean and variances of the observer locations. The mean and the variances are not estimated by standard unfolding methods which yield targets maps that are invariant to a rotation of the coordinate system. The data is transformed so that the nonlinearity due to the squared observer locations is removed. The sampling properties of the estimates are derived from the asymptotic variances of the additive errors of a maximum likelihood factor analysis of the sample covariance matrix of the transformed data augmented with bootstrapping. The robustness of the new method is tested using artificial data. The method is applied to a 2001 survey data set from Turkey to provide a real data example.  相似文献   

12.
Markov chain Monte Carlo techniques have revolutionized the field of Bayesian statistics. Their power is so great that they can even accommodate situations in which the structure of the statistical model itself is uncertain. However, the analysis of such trans-dimensional (TD) models is not easy and available software may lack the flexibility required for dealing with the complexities of real data, often because it does not allow the TD model to be simply part of some bigger model. In this paper we describe a class of widely applicable TD models that can be represented by a generic graphical model, which may be incorporated into arbitrary other graphical structures without significantly affecting the mechanism of inference. We also present a decomposition of the reversible jump algorithm into abstract and problem-specific components, which provides infrastructure for applying the method to all models in the class considered. These developments represent a first step towards a context-free method for implementing TD models that will facilitate their use by applied scientists for the practical exploration of model uncertainty. Our approach makes use of the popular WinBUGS framework as a sampling engine and we illustrate its use via two simple examples in which model uncertainty is a key feature.  相似文献   

13.
Nonlinear mixed-effects (NLME) models are flexible enough to handle repeated-measures data from various disciplines. In this article, we propose both maximum-likelihood and restricted maximum-likelihood estimations of NLME models using first-order conditional expansion (FOCE) and the expectation–maximization (EM) algorithm. The FOCE-EM algorithm implemented in the ForStat procedure SNLME is compared with the Lindstrom and Bates (LB) algorithm implemented in both the SAS macro NLINMIX and the S-Plus/R function nlme in terms of computational efficiency and statistical properties. Two realworld data sets an orange tree data set and a Chinese fir (Cunninghamia lanceolata) data set, and a simulated data set were used for evaluation. FOCE-EM converged for all mixed models derived from the base model in the two realworld cases, while LB did not, especially for the models in which random effects are simultaneously considered in several parameters to account for between-subject variation. However, both algorithms had identical estimated parameters and fit statistics for the converged models. We therefore recommend using FOCE-EM in NLME models, particularly when convergence is a concern in model selection.  相似文献   

14.
In many longitudinal studies multiple characteristics of each individual, along with time to occurrence of an event of interest, are often collected. In such data set, some of the correlated characteristics may be discrete and some of them may be continuous. In this paper, a joint model for analysing multivariate longitudinal data comprising mixed continuous and ordinal responses and a time to event variable is proposed. We model the association structure between longitudinal mixed data and time to event data using a multivariate zero-mean Gaussian process. For modeling discrete ordinal data we assume a continuous latent variable follows the logistic distribution and for continuous data a Gaussian mixed effects model is used. For the event time variable, an accelerated failure time model is considered under different distributional assumptions. For parameter estimation, a Bayesian approach using Markov Chain Monte Carlo is adopted. The performance of the proposed methods is illustrated using some simulation studies. A real data set is also analyzed, where different model structures are used. Model comparison is performed using a variety of statistical criteria.  相似文献   

15.
The Buckley–James estimator (BJE) [J. Buckley and I. James, Linear regression with censored data, Biometrika 66 (1979), pp. 429–436] has been extended from right-censored (RC) data to interval-censored (IC) data by Rabinowitz et al. [D. Rabinowitz, A. Tsiatis, and J. Aragon, Regression with interval-censored data, Biometrika 82 (1995), pp. 501–513]. The BJE is defined to be a zero-crossing of a modified score function H(b), a point at which H(·) changes its sign. We discuss several approaches (for finding a BJE with IC data) which are extensions of the existing algorithms for RC data. However, these extensions may not be appropriate for some data, in particular, they are not appropriate for a cancer data set that we are analysing. In this note, we present a feasible iterative algorithm for obtaining a BJE. We apply the method to our data.  相似文献   

16.
When a two-level multilevel model (MLM) is used for repeated growth data, the individuals constitute level 2 and the successive measurements constitute level 1, which is nested within the individuals that make up level 2. The heterogeneity among individuals is represented by either the random-intercept or random-coefficient (slope) model. The variance components at level 1 involve serial effects and measurement errors under constant variance or heteroscedasticity. This study hypothesizes that missing serial effects or/and heteroscedasticity may bias the results obtained from two-level models. To illustrate this effect, we conducted two simulation studies, where the simulated data were based on the characteristics of an empirical mouse tumour data set. The results suggest that for repeated growth data with constant variance (measurement error) and misspecified serial effects (ρ > 0.3), the proportion of level-2 variation (intra-class correlation coefficient) increases with ρ and the two-level random-coefficient model is the minimum AIC (or AICc) model when compared with the fixed model, heteroscedasticity model, and random-intercept model. In addition, the serial effect (ρ > 0.1) and heteroscedasticity are both misspecified, implying that the two-level random-coefficient model is the minimum AIC (or AICc) model when compared with the fixed model and random-intercept model. This study demonstrates that missing serial effects and/or heteroscedasticity may indicate heterogeneity among individuals in repeated growth data (mixed or two-level MLM). This issue is critical in biomedical research.  相似文献   

17.
Robust mixture modelling using the t distribution   总被引:2,自引:0,他引:2  
Normal mixture models are being increasingly used to model the distributions of a wide variety of random phenomena and to cluster sets of continuous multivariate data. However, for a set of data containing a group or groups of observations with longer than normal tails or atypical observations, the use of normal components may unduly affect the fit of the mixture model. In this paper, we consider a more robust approach by modelling the data by a mixture of t distributions. The use of the ECM algorithm to fit this t mixture model is described and examples of its use are given in the context of clustering multivariate data in the presence of atypical observations in the form of background noise.  相似文献   

18.
In dose-response models, there are cases where only a portion of the administered dose may have an effect. This results in a stochastic compliance of the administered dose. In a previous paper (Chen-Mok and Sen, 1999), we developed suitable adjustments for compliance in the logistic model under the assumption of nondifferential measurement error. These compliance-adjusted models were categorized into three types: (i) Low (or near zero) dose levels, (ii) moderate dose levels, and (iii) high dose levels. In this paper, we analyze a set of data on the atomic bomb survivors of Japan to illustrate the use of the proposed methods. In addition, we examine the performance of these methods under different conditions based on a simulation study. Among all three cases, the adjustments proposed for the moderate dose case do not seem to work adequately. Both bias and variance are larger when using the adjusted model in comparison with the unadjusted model. The adjustments for the low dose case seem to work in reducing the bias in the estimation of the parameters under all types of compliance distributions. The MSEs, however, are larger under some of the compliance distribution considered. Finally, the results of this simulation study show that the adjustments for the high dose case are successful in achieving both a reduction in bias as well as a reduction in MSE, hence the overall efficiency of the estimation is improved.  相似文献   

19.
In this paper, we extend the censored linear regression model with normal errors to Student-t errors. A simple EM-type algorithm for iteratively computing maximum-likelihood estimates of the parameters is presented. To examine the performance of the proposed model, case-deletion and local influence techniques are developed to show its robust aspect against outlying and influential observations. This is done by the analysis of the sensitivity of the EM estimates under some usual perturbation schemes in the model or data and by inspecting some proposed diagnostic graphics. The efficacy of the method is verified through the analysis of simulated data sets and modelling a real data set first analysed under normal errors. The proposed algorithm and methods are implemented in the R package CensRegMod.  相似文献   

20.
Event history models typically assume that the entire population is at risk of experiencing the event of interest throughout the observation period. However, there will often be individuals, referred to as long-term survivors, who may be considered a priori to have a zero hazard throughout the study period. In this paper, a discrete-time mixture model is proposed in which the probability of long-term survivorship and the timing of event occurrence are modelled jointly. Another feature of event history data that often needs to be considered is that they may come from a population with a hierarchical structure. For example, individuals may be nested within geographical regions and individuals in the same region may have similar risks of experiencing the event of interest due to unobserved regional characteristics. Thus, the discrete-time mixture model is extended to allow for clustering in the likelihood and timing of an event within regions. The model is further extended to allow for unobserved individual heterogeneity in the hazard of event occurrence. The proposed model is applied in an analysis of contraceptive sterilization in Bangladesh. The results show that a woman's religion and education level affect her probability of choosing sterilization, but not when she gets sterilized. There is also evidence of community-level variation in sterilization timing, but not in the probability of sterilization.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号