首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 281 毫秒
1.
In real-life situations, we often encounter data sets containing missing observations. Statistical methods that address missingness have been extensively studied in recent years. One of the more popular approaches involves imputation of the missing values prior to the analysis, thereby rendering the data complete. Imputation broadly encompasses an entire scope of techniques that have been developed to make inferences about incomplete data, ranging from very simple strategies (e.g. mean imputation) to more advanced approaches that require estimation, for instance, of posterior distributions using Markov chain Monte Carlo methods. Additional complexity arises when the number of missingness patterns increases and/or when both categorical and continuous random variables are involved. Implementation of routines, procedures, or packages capable of generating imputations for incomplete data are now widely available. We review some of these in the context of a motivating example, as well as in a simulation study, under two missingness mechanisms (missing at random and missing not at random). Thus far, evaluation of existing implementations have frequently centred on the resulting parameter estimates of the prescribed model of interest after imputing the missing data. In some situations, however, interest may very well be on the quality of the imputed values at the level of the individual – an issue that has received relatively little attention. In this paper, we focus on the latter to provide further insight about the performance of the different routines, procedures, and packages in this respect.  相似文献   

2.
Random forests are widely used in many research fields for prediction and interpretation purposes. Their popularity is rooted in several appealing characteristics, such as their ability to deal with high dimensional data, complex interactions and correlations between variables. Another important feature is that random forests provide variable importance measures that can be used to identify the most important predictor variables. Though there are alternatives like complete case analysis and imputation, existing methods for the computation of such measures cannot be applied straightforward when the data contains missing values. This paper presents a solution to this pitfall by introducing a new variable importance measure that is applicable to any kind of data—whether it does or does not contain missing values. An extensive simulation study shows that the new measure meets sensible requirements and shows good variable ranking properties. An application to two real data sets also indicates that the new approach may provide a more sensible variable ranking than the widespread complete case analysis. It takes the occurrence of missing values into account which makes results also differ from those obtained under multiple imputation.  相似文献   

3.
In the past decades, the number of variables explaining observations in different practical applications increased gradually. This has led to heavy computational tasks, despite of widely using provisional variable selection methods in data processing. Therefore, more methodological techniques have appeared to reduce the number of explanatory variables without losing much of the information. In these techniques, two distinct approaches are apparent: ‘shrinkage regression’ and ‘sufficient dimension reduction’. Surprisingly, there has not been any communication or comparison between these two methodological categories, and it is not clear when each of these two approaches are appropriate. In this paper, we fill some of this gap by first reviewing each category in brief, paying special attention to the most commonly used methods in each category. We then compare commonly used methods from both categories based on their accuracy, computation time, and their ability to select effective variables. A simulation study on the performance of the methods in each category is generated as well. The selected methods are concurrently tested on two sets of real data which allows us to recommend conditions under which one approach is more appropriate to be applied to high-dimensional data.  相似文献   

4.
Outlier tests are developed for multivariate data where there is a structure to the covariance or correlation matrix. Particular structures considered are the block diagonal structure where there are reasons to assume that one set of variables is independent of another, and the equicorrelation structure where it may be assumed that all pairs of variables have the same correlation. Likelihood ratio tests for an outlier are derived for these situations and critical values, under the null hypothesis of no outliers present, are determined for selected sample sizes and dimensions, using Bonferroni bounds or simulation. The powers of the tests are compared with those of the Wilks′ statistic for a variety of situations. It is shown that the test procedures which incorporate knowledge of the correlation structure have considerably greater power than the usual tests particularly in relatively small samples with several dimensions.  相似文献   

5.
Classical omnibus and more recent methods are adapted to panel data situations in order to jointly test for normality of the error components. The test statistics incorporate either the empirical distribution function or the empirical characteristic function, these functions resulting from estimation of the fixed and random components. Monte Carlo results show that the new procedure based on the empirical characteristic function compares favorably with classical methods.  相似文献   

6.
Missing data often complicate the analysis of scientific data. Multiple imputation is a general purpose technique for analysis of datasets with missing values. The approach is applicable to a variety of missing data patterns but often complicated by some restrictions like the type of variables to be imputed and the mechanism underlying the missing data. In this paper, the authors compare the performance of two multiple imputation methods, namely fully conditional specification and multivariate normal imputation in the presence of ordinal outcomes with monotone missing data patterns. Through a simulation study and an empirical example, the authors show that the two methods are indeed comparable meaning any of the two may be used when faced with scenarios, at least, as the ones presented here.  相似文献   

7.
ABSTRACT

The application of conventional statistical methods to directional data generally produces erroneous results. Various regression models for a circular response have been presented in the literature, however these are unsatisfactory either in the limited relationships that can be modeled, or the limitations on the number or type of covariates admissible. One difficulty with circular regression is devising a meaningful regression function. This problem is exacerbated when trying to incorporate both linear and circular variables as covariates. Due to these complexities, circular regression is ripe for exploration via tree-based methods, in which a formal regression function is not needed, but where insight into the general structure and relationship between predictors and the response may be obtained. A basic framework for regression trees, predicting a circular response from a combination of circular and linear predictors, will be presented.  相似文献   

8.
Although “choose all that apply” questions are common in modern surveys, methods for analyzing associations among responses to such questions have only recently been developed. These methods are generally valid only for simple random sampling, but these types of questions often appear in surveys conducted under more complex sampling plans. The purpose of this article is to provide statistical analysis methods that can be applied to “choose all that apply” questions in complex survey sampling situations. Loglinear models are developed to incorporate the multiple responses inherent in these types of questions. Statistics to compare models and to measure association are proposed and their asymptotic distributions are derived. Monte Carlo simulations show that tests based on adjusted Pearson statistics generally hold their correct size when comparing models. These simulations also show that confidence intervals for odds ratios estimated from loglinear models have good coverage properties, while being shorter than those constructed using empirical estimates. Furthermore, the methods are shown to be applicable to more general problems of modeling associations between elements of two or more binary vectors. The proposed analysis methods are applied to data from the National Health and Nutrition Examination Survey. The Canadian Journal of Statistics © 2009 Statistical Society of Canada  相似文献   

9.
When missing data occur in studies designed to compare the accuracy of diagnostic tests, a common, though naive, practice is to base the comparison of sensitivity, specificity, as well as of positive and negative predictive values on some subset of the data that fits into methods implemented in standard statistical packages. Such methods are usually valid only under the strong missing completely at random (MCAR) assumption and may generate biased and less precise estimates. We review some models that use the dependence structure of the completely observed cases to incorporate the information of the partially categorized observations into the analysis and show how they may be fitted via a two-stage hybrid process involving maximum likelihood in the first stage and weighted least squares in the second. We indicate how computational subroutines written in R may be used to fit the proposed models and illustrate the different analysis strategies with observational data collected to compare the accuracy of three distinct non-invasive diagnostic methods for endometriosis. The results indicate that even when the MCAR assumption is plausible, the naive partial analyses should be avoided.  相似文献   

10.
Pair-copula constructions (or vine copulas) are structured, in the layout of vines, with bivariate copulas and conditional bivariate copulas. The main contribution of the current work is an approach to the long-standing problem: how to cope with the dependence structure between the two conditioned variables indicated by an edge, acknowledging that the dependence structure changes with the values of the conditioning variables. The changeable dependence problem, though recognized as crucial in the field of multivariate modelling, remains widely unexplored due to its inherent complication and hence is the motivation of the current work. Rather than resorting to traditional parametric or nonparametric methods, we proceed from an innovative viewpoint: approximating a conditional copula, to any required degree of approximation, by utilizing a family of basis functions. We fully incorporate the impact of the conditioning variables on the functional form of a conditional copula by employing local learning methods. The attractions and dilemmas of the pair-copula approximating technique are revealed via simulated data, and its practical importance is evidenced via a real data set.  相似文献   

11.
Various methods for clustering mixed-mode data are compared. It is found that a method based on a finite mixture model in which the observed categorical variables are generated from underlying continuous variables out-performs more conventional methods when applied to artificially generated data. This method also performs best when applied to Fisher's iris data in which two of the variables are categorized by applying thresholds.  相似文献   

12.
We incorporate a random clustering effect into the nonparametric version of Cox Proportional Hazards model to characterize clustered survival data. The simulation studies provide evidence that clustered survival data can be better characterized through a nonparametric model. Predictive accuracy of the nonparametric model is affected by number of clusters and distribution of the random component accounting for clustering effect. As the functional form of the covariate departs from linearity, the nonparametric model is becoming more advantageous over the parametric counterpart. Finally, nonparametric is better than parametric model when data are highly heterogenous and/or there is misspecification error.  相似文献   

13.
In the paper some problems of minimax estimation and prediction are solved in the case when statisticians estimate the same parameters (predict the values of random variables which distribution depends on the same parameters), know the dimensions of the samples of their colleagues but do not know the values of these samples  相似文献   

14.
In many regression problems, predictors are naturally grouped. For example, when a set of dummy variables is used to represent categorical variables, or a set of basis functions of continuous variables is included in the predictor set, it is important to carry out a feature selection both at the group level and at individual variable levels within the group simultaneously. To incorporate the group and variables within-group information into a regularized model fitting, several regularization methods have been developed, including the Cox regression and the conditional mean regression. Complementary to earlier works, the simultaneous group and within-group variables selection method is examined in quantile regression. We propose a hierarchically penalized quantile regression, and show that the hierarchical penalty possesses the oracle property in quantile regression, as well as in the Cox regression. The proposed method is evaluated through simulation studies and a real data application.  相似文献   

15.
Finite mixtures of distributions have been getting increasing use in the applied literature. In the continuous case, linear combinations of exponentials and gammas have been shown to be well suited for modeling purposes. In the discrete case, the focus has primarily been on continuous mixing, usually of Poisson distributions and typically using gammas to describe the random parameter, But many of these applications are forced, especially when a continuous mixing distribution is used. Instead, it is often prefe-rable to try finite mixtures of geometries or negative binomials, since these are the fundamental building blocks of all discrete random variables. To date, a major stumbling block to their use has been the lack of easy routines for estimating the parameters of such models. This problem has now been alleviated by the adaptation to the discrete case of numerical procedures recently developed for exponential, Weibull, and gamma mixtures. The new methods have been applied to four previously studied data sets, and significant improvements reported in goodness-of-fit, with resultant implications for each affected study.  相似文献   

16.
Analysts of survey data are often interested in modelling the population process, or superpopulation, that gave rise to a 'target' set of survey variables. An important tool for this is maximum likelihood estimation. A survey is said to provide limited information for such inference if data used in the design of the survey are unavailable to the analyst. In this circumstance, sample inclusion probabilities, which are typically available, provide information which needs to be incorporated into the analysis. We consider the case where these inclusion probabilities can be modelled in terms of a linear combination of the design and target variables, and only sample values of these are available. Strict maximum likelihood estimation of the underlying superpopulation means of these variables appears to be analytically impossible in this case, but an analysis based on approximations to the inclusion probabilities leads to a simple estimator which is a close approximation to the maximum likelihood estimator. In a simulation study, this estimator outperformed several other estimators that are based on approaches suggested in the sampling literature.  相似文献   

17.
In many experiments, several measurements on the same variable are taken over time, a geographic region, or some other index set. It is often of interest to know if there has been a change over the index set in the parameters of the distribution of the variable. Frequently, the data consist of a sequence of correlated random variables, and there may also be several experimental units under observation, each providing a sequence of data. A problem in ascertaining the boundaries between the layers in geological sedimentary beds is used to introduce the model and then to illustrate the proposed methodology. It is assumed that, conditional on the change point, the data from each sequence arise from an autoregressive process that undergoes a change in one or more of its parameters. Unconditionally, the model then becomes a mixture of nonstationary autoregressive processes. Maximum-likelihood methods are used, and results of simulations to evaluate the performance of these estimators under practical conditions are given.  相似文献   

18.
Binary data are commonly used as responses to assess the effects of independent variables in longitudinal factorial studies. Such effects can be assessed in terms of the rate difference (RD), the odds ratio (OR), or the rate ratio (RR). Traditionally, the logistic regression seems always a recommended method with statistical comparisons made in terms of the OR. Statistical inference in terms of the RD and RR can then be derived using the delta method. However, this approach is hard to realize when repeated measures occur. To obtain statistical inference in longitudinal factorial studies, the current article shows that the mixed-effects model for repeated measures, the logistic regression for repeated measures, the log-transformed regression for repeated measures, and the rank-based methods are all valid methods that lead to inference in terms of the RD, OR, and RR, respectively. Asymptotic linear relationships between the estimators of the regression coefficients of these models are derived when the weight (working covariance) matrix is an identity matrix. Conditions for the Wald-type tests to be asymptotically equivalent in these models are provided and powers were compared using simulation studies. A phase III clinical trial is used to illustrate the investigated methods with corresponding SAS® code supplied.  相似文献   

19.
Clinical studies in overactive bladder have traditionally used analysis of covariance or nonparametric methods to analyse the number of incontinence episodes and other count data. It is known that if the underlying distributional assumptions of a particular parametric method do not hold, an alternative parametric method may be more efficient than a nonparametric one, which makes no assumptions regarding the underlying distribution of the data. Therefore, there are advantages in using methods based on the Poisson distribution or extensions of that method, which incorporate specific features that provide a modelling framework for count data. One challenge with count data is overdispersion, but methods are available that can account for this through the introduction of random effect terms in the modelling, and it is this modelling framework that leads to the negative binomial distribution. These models can also provide clinicians with a clearer and more appropriate interpretation of treatment effects in terms of rate ratios. In this paper, the previously used parametric and non‐parametric approaches are contrasted with those based on Poisson regression and various extensions in trials evaluating solifenacin and mirabegron in patients with overactive bladder. In these applications, negative binomial models are seen to fit the data well. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

20.
Regression methods for common data types such as measured, count and categorical variables are well understood but increasingly statisticians need ways to model relationships between variable types such as shapes, curves, trees, correlation matrices and images that do not fit into the standard framework. Data types that lie in metric spaces but not in vector spaces are difficult to use within the usual regression setting, either as the response and/or a predictor. We represent the information in these variables using distance matrices which requires only the specification of a distance function. A low-dimensional representation of such distance matrices can be obtained using methods such as multidimensional scaling. Once these variables have been represented as scores, an internal model linking the predictors and the responses can be developed using standard methods. We call scoring as the transformation from a new observation to a score, whereas backscoring is a method to represent a score as an observation in the data space. Both methods are essential for prediction and explanation. We illustrate the methodology for shape data, unregistered curve data and correlation matrices using motion capture data from an experiment to study the motion of children with cleft lip.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号