首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
Summary.  Multilevel modelling is sometimes used for data from complex surveys involving multistage sampling, unequal sampling probabilities and stratification. We consider generalized linear mixed models and particularly the case of dichotomous responses. A pseudolikelihood approach for accommodating inverse probability weights in multilevel models with an arbitrary number of levels is implemented by using adaptive quadrature. A sandwich estimator is used to obtain standard errors that account for stratification and clustering. When level 1 weights are used that vary between elementary units in clusters, the scaling of the weights becomes important. We point out that not only variance components but also regression coefficients can be severely biased when the response is dichotomous. The pseudolikelihood methodology is applied to complex survey data on reading proficiency from the American sample of the 'Program for international student assessment' 2000 study, using the Stata program gllamm which can estimate a wide range of multilevel and latent variable models. Performance of pseudo-maximum-likelihood with different methods for handling level 1 weights is investigated in a Monte Carlo experiment. Pseudo-maximum-likelihood estimators of (conditional) regression coefficients perform well for large cluster sizes but are biased for small cluster sizes. In contrast, estimators of marginal effects perform well in both situations. We conclude that caution must be exercised in pseudo-maximum-likelihood estimation for small cluster sizes when level 1 weights are used.  相似文献   

2.
Among the diverse frameworks that have been proposed for regression analysis of angular data, the projected multivariate linear model provides a particularly appealing and tractable methodology. In this model, the observed directional responses are assumed to correspond to the angles formed by latent bivariate normal random vectors that are assumed to depend upon covariates through a linear model. This implies an angular normal distribution for the observed angles, and incorporates a regression structure through a familiar and convenient relationship. In this paper we extend this methodology to accommodate clustered data (e.g., longitudinal or repeated measures data) by formulating a marginal version of the model and basing estimation on an EM‐like algorithm in which correlation among within‐cluster responses is taken into account by incorporating a working correlation matrix into the M step. A sandwich estimator is used for the parameter estimates’ covariance matrix. The methodology is motivated and illustrated using an example involving clustered measurements of microbril angle on loblolly pine (Pinus taeda L.) Simulation studies are presented that evaluate the finite sample properties of the proposed fitting method. In addition, the relationship between within‐cluster correlation on the latent Euclidean vectors and the corresponding correlation structure for the observed angles is explored.  相似文献   

3.
Shared frailty models are of interest when one has clustered survival data and when focus is on comparing the lifetimes within clusters and further on estimating the correlation between lifetimes from the same cluster. It is well known that the positive stable model should be preferred to the gamma model in situations where the correlated survival data show a decreasing association with time. In this paper, we devise a likelihood based estimation procedure for the positive stable shared frailty Cox model, which is expected to obtain high efficiency. The proposed estimator is provided with large sample properties and also a consistent estimator of standard errors is given. Simulation studies show that the estimation procedure is appropriate for practical use, and that it is much more efficient than a recently suggested procedure. The suggested methodology is applied to a dataset concerning time to blindness for patients with diabetic retinopathy.  相似文献   

4.
The data collection process and the inherent population structure are the main causes for clustered data. The observations in a given cluster are correlated, and the magnitude of such correlation is often measured by the intra-cluster correlation coefficient. The intra-cluster correlation can lead to an inflated size of the standard F test in a linear model. In this paper, we propose a solution to this problem. Unlike previous adjustments, our method does not require estimation of the intra-class correlation, which is problematic especially when the number of clusters is small. Our simulation results show that the new method outperforms the existing methods.  相似文献   

5.
Generalized estimating equations (GEE) have become a popular method for marginal regression modelling of data that occur in clusters. Features of the GEE methodology are the use of a ‘working covariance’, an approximation to the underlying covariance, which is used to improve the efficiency in estimating the regression coefficients, and the ‘sandwich’ estimate of variance, which provides a way of consistently estimating their standard errors. These techniques have been extended to include estimating equations for the underlying correlation structure, both to improve the efficiency of the regression coefficient estimates and to provide estimates of correlations between units in a cluster, when these are of interest. If the mean structure is of primary interest, then a simpler set of equations (GEE1) can be used, whereas if the underlying covariance structure is of interest in its own right, the use of the more complex GEE2 estimating equations is often recommended. In this paper, we compare the effect of increasing the complexity of the ‘working covariances’ on the variance of the parameter estimates, as well as the mean-squared error of the ‘sandwich’ estimate of variance. We give asymptotic expressions for these variances and mean-squared error terms. We use these to study the behaviour of different variants of GEE1 and GEE2 when we change the number of clusters, the cluster size, and the within-cluster correlation. We conclude that the extra complexity of the full GEE2 approach is not usually justified if the mean structure is of primary interest.  相似文献   

6.
A robust generalized score test for comparing groups of cluster binary data is proposed. This novel test is asymptotically valid for practically any underlying correlation configurations including the situation when correlation coefficients vary within or between clusters. This structure generally undermines the validity of the typical large sample properties of the method of maximum likelihood. Simulations and real data analysis are used to demonstrate the merit of this parametric robust method. Results show that our test is superior to two recently proposed test statistics advocated by other researchers.  相似文献   

7.
The article describes a generalized estimating equations approach that was used to investigate the impact of technology on vessel performance in a trawl fishery during 1988–96, while accounting for spatial and temporal correlations in the catch-effort data. Robust estimation of parameters in the presence of several levels of clustering depended more on the choice of cluster definition than on the choice of correlation structure within the cluster. Models with smaller cluster sizes produced stable results, while models with larger cluster sizes, that may have had complex within-cluster correlation structures and that had within-cluster covariates, produced estimates sensitive to the correlation structure. The preferred model arising from this dataset assumed that catches from a vessel were correlated in the same years and the same areas, but independent in different years and areas. The model that assumed catches from a vessel were correlated in all years and areas, equivalent to a random effects term for vessel, produced spurious results. This was an unexpected finding that highlighted the need to adopt a systematic strategy for modelling. The article proposes a modelling strategy of selecting the best cluster definition first, and the working correlation structure (within clusters) second. The article discusses the selection and interpretation of the model in the light of background knowledge of the data and utility of the model, and the potential for this modelling approach to apply in similar statistical situations.  相似文献   

8.
We consider estimation in a high-dimensional linear model with strongly correlated variables. We propose to cluster the variables first and do subsequent sparse estimation such as the Lasso for cluster-representatives or the group Lasso based on the structure from the clusters. Regarding the first step, we present a novel and bottom-up agglomerative clustering algorithm based on canonical correlations, and we show that it finds an optimal solution and is statistically consistent. We also present some theoretical arguments that canonical correlation based clustering leads to a better-posed compatibility constant for the design matrix which ensures identifiability and an oracle inequality for the group Lasso. Furthermore, we discuss circumstances where cluster-representatives and using the Lasso as subsequent estimator leads to improved results for prediction and detection of variables. We complement the theoretical analysis with various empirical results.  相似文献   

9.
Summary.  Generalized estimating equations for correlated repeated ordinal score data are developed assuming a proportional odds model and a working correlation structure based on a first-order autoregressive process. Repeated ordinal scores on the same experimental units, not necessarily with equally spaced time intervals, are assumed and a new algorithm for the joint estimation of the model regression parameters and the correlation coefficient is developed. Approximate standard errors for the estimated correlation coefficient are developed and a simulation study is used to compare the new methodology with existing methodology. The work was part of a project on post-harvest quality of pot-plants and the generalized estimating equation model is used to analyse data on poinsettia and begonia pot-plant quality deterioration over time. The relationship between the key attributes of plant quality and the quality and longevity of ornamental pot-plants during shelf and after-sales life is explored.  相似文献   

10.
Efficient estimation of the regression coefficients in longitudinal data analysis requires a correct specification of the covariance structure. If misspecification occurs, it may lead to inefficient or biased estimators of parameters in the mean. One of the most commonly used methods for handling the covariance matrix is based on simultaneous modeling of the Cholesky decomposition. Therefore, in this paper, we reparameterize covariance structures in longitudinal data analysis through the modified Cholesky decomposition of itself. Based on this modified Cholesky decomposition, the within-subject covariance matrix is decomposed into a unit lower triangular matrix involving moving average coefficients and a diagonal matrix involving innovation variances, which are modeled as linear functions of covariates. Then, we propose a fully Bayesian inference for joint mean and covariance models based on this decomposition. A computational efficient Markov chain Monte Carlo method which combines the Gibbs sampler and Metropolis–Hastings algorithm is implemented to simultaneously obtain the Bayesian estimates of unknown parameters, as well as their standard deviation estimates. Finally, several simulation studies and a real example are presented to illustrate the proposed methodology.  相似文献   

11.
The Generalized Estimating Equation (GEE) method popularized by Liang and Zeger provides a very general method for fitting regression models to observations that occur in clusters. Features of the method are the specification of a 'working correlation' (a guess at the true correlation structure of the data) which is used to improve efficiency in estimating the regression coefficients, and the 'information sandwich' which provides a way of consistently estimating the standard errors of the estimated regression coefficients even if (as we might expect) the working correlation is wrong. This paper develops asymptotic expressions for the bias and efficiency both of the regression coefficient estimates and of the sandwich estimate, and uses them to study the behaviour of the estimates.
It looks at the effect of the choice of the working correlation on the estimate and also examines the effect of different cluster sizes and different degrees of correlation between the covariates. The performance of these methods is found to be excellent, particularly when the degree of correlation in the responses and covariates is small to moderate.  相似文献   

12.
Social network data represent the interactions between a group of social actors. Interactions between colleagues and friendship networks are typical examples of such data.The latent space model for social network data locates each actor in a network in a latent (social) space and models the probability of an interaction between two actors as a function of their locations. The latent position cluster model extends the latent space model to deal with network data in which clusters of actors exist — actor locations are drawn from a finite mixture model, each component of which represents a cluster of actors.A mixture of experts model builds on the structure of a mixture model by taking account of both observations and associated covariates when modeling a heterogeneous population. Herein, a mixture of experts extension of the latent position cluster model is developed. The mixture of experts framework allows covariates to enter the latent position cluster model in a number of ways, yielding different model interpretations.Estimates of the model parameters are derived in a Bayesian framework using a Markov Chain Monte Carlo algorithm. The algorithm is generally computationally expensive — surrogate proposal distributions which shadow the target distributions are derived, reducing the computational burden.The methodology is demonstrated through an illustrative example detailing relationships between a group of lawyers in the USA.  相似文献   

13.
This paper develops a nonparametric model of the relationship between survival S and a dichotomous random variable X under the order constraint that P(X=1|S=s) is increasing (or decreasing) with s. The estimation procedure, called isotonic regression, has been studied in some depth for the case of uncensored data, but we give a methodology which is appropriate in the more general context of right, left, and interval censored data. An E-M Algorithm (Dempster et. al., 1977) is used for maximum likelihood estimation.  相似文献   

14.
We consider the adjustment, based upon a sample of size n, of collections of vectors drawn from either an infinite or finite population. The vectors may be judged to be either normally distributed or, more generally, second-order exchangeable. We develop the work of Goldstein and Wooff (1998) to show how the familiar univariate finite population corrections (FPCs) naturally generalise to individual quantities in the multivariate population. The types of information we gain by sampling are identified with the orthogonal canonical variable directions derived from a generalised eigenvalue problem. These canonical directions share the same co-ordinate representation for all sample sizes and, for equally defined individuals, all population sizes enabling simple comparisons between both the effects of different sample sizes and of different population sizes. We conclude by considering how the FPC is modified for multivariate cluster sampling with exchangeable clusters. In univariate two-stage cluster sampling, we may decompose the variance of the population mean into the sum of the variance of cluster means and the variance of the cluster members within clusters. The first term has a FPC relating to the sampling fraction of clusters, the second term has a FPC relating to the sampling fraction of cluster size. We illustrate how this generalises in the multivariate case. We decompose the variance into two terms: the first relating to multivariate finite population sampling of clusters and the second to multivariate finite population sampling within clusters. We solve two generalised eigenvalue problems to show how to generalise the univariate to the multivariate: each of the two FPCs attaches to one, and only one, of the two eigenbases.  相似文献   

15.
We use simulations based on data on injury severity in car accidents to compare methods for the analysis of very large data sets containing clusters of individuals for which the measured response is polytomous. Retrospective sampling of clusters is used to expedite the analysis of the large data set while at the same time obtaining information about rare, but important, outcomes. An additional complication in the analysis of such data sets is that there can be two types of covariates: those which vary within a cluster and those which vary only among clusters. Weighted generalized estimating equations are developed to obtain consistent estimates of the regression coefficients in a proportional-odds model, along with a weighted robust covariance matrix to estimate the variabilities of these estimated coefficients.  相似文献   

16.
This paper presents a new Bayesian, infinite mixture model based, clustering approach, specifically designed for time-course microarray data. The problem is to group together genes which have “similar” expression profiles, given the set of noisy measurements of their expression levels over a specific time interval. In order to capture temporal variations of each curve, a non-parametric regression approach is used. Each expression profile is expanded over a set of basis functions and the sets of coefficients of each curve are subsequently modeled through a Bayesian infinite mixture of Gaussian distributions. Therefore, the task of finding clusters of genes with similar expression profiles is then reduced to the problem of grouping together genes whose coefficients are sampled from the same distribution in the mixture. Dirichlet processes prior is naturally employed in such kinds of models, since it allows one to deal automatically with the uncertainty about the number of clusters. The posterior inference is carried out by a split and merge MCMC sampling scheme which integrates out parameters of the component distributions and updates only the latent vector of the cluster membership. The final configuration is obtained via the maximum a posteriori estimator. The performance of the method is studied using synthetic and real microarray data and is compared with the performances of competitive techniques.  相似文献   

17.
Abstract. In this paper, conditional on random family effects, we consider an auto‐regression model for repeated count data and their corresponding time‐dependent covariates, collected from the members of a large number of independent families. The count responses, in such a set up, unconditionally exhibit a non‐stationary familial–longitudinal correlation structure. We then take this two‐way correlation structure into account, and develop a generalized quasilikelihood (GQL) approach for the estimation of the regression effects and the familial correlation index parameter, whereas the longitudinal correlation parameter is estimated by using the well‐known method of moments. The performance of the proposed estimation approach is examined through a simulation study. Some model mis‐specification effects are also studied. The estimation methodology is illustrated by analysing real life healthcare utilization count data collected from 36 families of size four over a period of 4 years.  相似文献   

18.
Data collection process in most observational and experimental studies yield different types of variables, leading to the use of joint models that are capable of handling multiple data types. Evaluation of various statistical techniques that have been developed for mixed data in simulated environments requires concurrent generation of multiple variables. In this article, I present an important augmentation to a unified framework proposed in our previously published work for simultaneously generating binary and nonnormal continuous data given the marginal characteristics and correlation structure, via fifth-order power polynomials that are known to extend the area covered in the skewness-elongation plane and to provide a better approximation to the probability density function of the continuous variables. I evaluate how well the improved methodology performs in comparison to the original one, in a simulated setting with illustrations of algorithmic steps. Although the relative gains for the associational quantities are not substantial, the augmented version appears to better capture the marginal quantities that are pertinent to the higher-order moments, as indicated by very close resemblance between the specified and empirically computed quantities on average.  相似文献   

19.

We propose two nonparametric Bayesian methods to cluster big data and apply them to cluster genes by patterns of gene–gene interaction. Both approaches define model-based clustering with nonparametric Bayesian priors and include an implementation that remains feasible for big data. The first method is based on a predictive recursion which requires a single cycle (or few cycles) of simple deterministic calculations for each observation under study. The second scheme is an exact method that divides the data into smaller subsamples and involves local partitions that can be determined in parallel. In a second step, the method requires only the sufficient statistics of each of these local clusters to derive global clusters. Under simulated and benchmark data sets the proposed methods compare favorably with other clustering algorithms, including k-means, DP-means, DBSCAN, SUGS, streaming variational Bayes and an EM algorithm. We apply the proposed approaches to cluster a large data set of gene–gene interactions extracted from the online search tool “Zodiac.”

  相似文献   

20.
A robust estimator for a wide family of mixtures of linear regression is presented. Robustness is based on the joint adoption of the cluster weighted model and of an estimator based on trimming and restrictions. The selected model provides the conditional distribution of the response for each group, as in mixtures of regression, and further supplies local distributions for the explanatory variables. A novel version of the restrictions has been devised, under this model, for separately controlling the two sources of variability identified in it. This proposal avoids singularities in the log-likelihood, caused by approximate local collinearity in the explanatory variables or local exact fits in regressions, and reduces the occurrence of spurious local maximizers. In a natural way, due to the interaction between the model and the estimator, the procedure is able to resist the harmful influence of bad leverage points along the estimation of the mixture of regressions, which is still an open issue in the literature. The given methodology defines a well-posed statistical problem, whose estimator exists and is consistent to the corresponding solution of the population optimum, under widely general conditions. A feasible EM algorithm has also been provided to obtain the corresponding estimation. Many simulated examples and two real datasets have been chosen to show the ability of the procedure, on the one hand, to detect anomalous data, and, on the other hand, to identify the real cluster regressions without the influence of contamination.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号