首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 142 毫秒
1.
Efficient statistical inference on nonignorable missing data is a challenging problem. This paper proposes a new estimation procedure based on composite quantile regression (CQR) for linear regression models with nonignorable missing data, that is applicable even with high-dimensional covariates. A parametric model is assumed for modelling response probability, which is estimated by the empirical likelihood approach. Local identifiability of the proposed strategy is guaranteed on the basis of an instrumental variable approach. A set of data-based adaptive weights constructed via an empirical likelihood method is used to weight CQR functions. The proposed method is resistant to heavy-tailed errors or outliers in the response. An adaptive penalisation method for variable selection is proposed to achieve sparsity with high-dimensional covariates. Limiting distributions of the proposed estimators are derived. Simulation studies are conducted to investigate the finite sample performance of the proposed methodologies. An application to the ACTG 175 data is analysed.  相似文献   

2.
When confronted with multiple covariates and a response variable, analysts sometimes apply a variable‐selection algorithm to the covariate‐response data to identify a subset of covariates potentially associated with the response, and then wish to make inferences about parameters in a model for the marginal association between the selected covariates and the response. If an independent data set were available, the parameters of interest could be estimated by using standard inference methods to fit the postulated marginal model to the independent data set. However, when applied to the same data set used by the variable selector, standard (“naive”) methods can lead to distorted inferences. The authors develop testing and interval estimation methods for parameters reflecting the marginal association between the selected covariates and response variable, based on the same data set used for variable selection. They provide theoretical justification for the proposed methods, present results to guide their implementation, and use simulations to assess and compare their performance to a sample‐splitting approach. The methods are illustrated with data from a recent AIDS study. The Canadian Journal of Statistics 37: 625–644; 2009 © 2009 Statistical Society of Canada  相似文献   

3.
Latent variable models are widely used for jointly modeling of mixed data including nominal, ordinal, count and continuous data. In this paper, we consider a latent variable model for jointly modeling relationships between mixed binary, count and continuous variables with some observed covariates. We assume that, given a latent variable, mixed variables of interest are independent and count and continuous variables have Poisson distribution and normal distribution, respectively. As such data may be extracted from different subpopulations, consideration of an unobserved heterogeneity has to be taken into account. A mixture distribution is considered (for the distribution of the latent variable) which accounts the heterogeneity. The generalized EM algorithm which uses the Newton–Raphson algorithm inside the EM algorithm is used to compute the maximum likelihood estimates of parameters. The standard errors of the maximum likelihood estimates are computed by using the supplemented EM algorithm. Analysis of the primary biliary cirrhosis data is presented as an application of the proposed model.  相似文献   

4.
Adjustment for covariates is a time-honored tool in statistical analysis and is often implemented by including the covariates that one intends to adjust as additional predictors in a model. This adjustment often does not work well when the underlying model is misspecified. We consider here the situation where we compare a response between two groups. This response may depend on a covariate for which the distribution differs between the two groups one intends to compare. This creates the potential that observed differences are due to differences in covariate levels rather than “genuine” population differences that cannot be explained by covariate differences. We propose a bootstrap-based adjustment method. Bootstrap weights are constructed with the aim of aligning bootstrap–weighted empirical distributions of the covariate between the two groups. Generally, the proposed weighted-bootstrap algorithm can be used to align or match the values of an explanatory variable as closely as desired to those of a given target distribution. We illustrate the proposed bootstrap adjustment method in simulations and in the analysis of data on the fecundity of historical cohorts of French-Canadian women.  相似文献   

5.
Semiparametric regression models with multiple covariates are commonly encountered. When there are covariates not associated with response variable, variable selection may lead to sparser models, more lucid interpretations and more accurate estimation. In this study, we adopt a sieve approach for the estimation of nonparametric covariate effects in semiparametric regression models. We adopt a two-step iterated penalization approach for variable selection. In the first step, a mixture of the Lasso and group Lasso penalties are employed to conduct the first-round variable selection and obtain the initial estimate. In the second step, a mixture of the weighted Lasso and weighted group Lasso penalties, with weights constructed using the initial estimate, are employed for variable selection. We show that the proposed iterated approach has the variable selection consistency property, even when number of unknown parameters diverges with sample size. Numerical studies, including simulation and analysis of a diabetes dataset, show satisfactory performance of the proposed approach.  相似文献   

6.
There are several procedures for fitting generalized additive models, i.e. regression models for an exponential family response where the influence of each single covariates is assumed to have unknown, potentially non-linear shape. Simulated data are used to compare a smoothing parameter optimization approach for selection of smoothness and of covariates, a stepwise approach, a mixed model approach, and a procedure based on boosting techniques. In particular it is investigated how the performance of procedures is linked to amount of information, type of response, total number of covariates, number of influential covariates, and extent of non-linearity. Measures for comparison are prediction performance, identification of influential covariates, and smoothness of fitted functions. One result is that the mixed model approach returns sparse fits with frequently over-smoothed functions, while the functions are less smooth for the boosting approach and variable selection is less strict. The other approaches are in between with respect to these measures. The boosting procedure is seen to perform very well when little information is available and/or when a large number of covariates is to be investigated. It is somewhat surprising that in scenarios with low information the fitting of a linear model, even with stepwise variable selection, has not much advantage over the fitting of an additive model when the true underlying structure is linear. In cases with more information the prediction performance of all procedures is very similar. So, in difficult data situations the boosting approach can be recommended, in others the procedures can be chosen conditional on the aim of the analysis.  相似文献   

7.
Abstract

In some clinical, environmental, or economical studies, researchers are interested in a semi-continuous outcome variable which takes the value zero with a discrete probability and has a continuous distribution for the non-zero values. Due to the measuring mechanism, it is not always possible to fully observe some outcomes, and only an upper bound is recorded. We call this left-censored data and observe only the maximum of the outcome and an independent censoring variable, together with an indicator. In this article, we introduce a mixture semi-parametric regression model. We consider a parametric model to investigate the influence of covariates on the discrete probability of the value zero. For the non-zero part of the outcome, a semi-parametric Cox’s regression model is used to study the conditional hazard function. The different parameters in this mixture model are estimated using a likelihood method. Hereby the infinite dimensional baseline hazard function is estimated by a step function. As results, we show the identifiability and the consistency of the estimators for the different parameters in the model. We study the finite sample behaviour of the estimators through a simulation study and illustrate this model on a practical data example.  相似文献   

8.
Abstract

Variable selection in finite mixture of regression (FMR) models is frequently used in statistical modeling. The majority of applications of variable selection in FMR models use a normal distribution for regression error. Such assumptions are unsuitable for a set of data containing a group or groups of observations with heavy tails and outliers. In this paper, we introduce a robust variable selection procedure for FMR models using the t distribution. With appropriate selection of the tuning parameters, the consistency and the oracle property of the regularized estimators are established. To estimate the parameters of the model, we develop an EM algorithm for numerical computations and a method for selecting tuning parameters adaptively. The parameter estimation performance of the proposed model is evaluated through simulation studies. The application of the proposed model is illustrated by analyzing a real data set.  相似文献   

9.
Summary. The paper focuses on a Bayesian treatment of measurement error problems and on the question of the specification of the prior distribution of the unknown covariates. It presents a flexible semiparametric model for this distribution based on a mixture of normal distributions with an unknown number of components. Implementation of this prior model as part of a full Bayesian analysis of measurement error problems is described in classical set-ups that are encountered in epidemiological studies: logistic regression between unknown covariates and outcome, with a normal or log-normal error model and a validation group. The feasibility of this combined model is tested and its performance is demonstrated in a simulation study that includes an assessment of the influence of misspecification of the prior distribution of the unknown covariates and a comparison with the semiparametric maximum likelihood method of Roeder, Carroll and Lindsay. Finally, the methodology is illustrated on a data set on coronary heart disease and cholesterol levels in blood.  相似文献   

10.
Latent Variable Models for Mixed Discrete and Continuous Outcomes   总被引:1,自引:0,他引:1  
We propose a latent variable model for mixed discrete and continuous outcomes. The model accommodates any mixture of outcomes from an exponential family and allows for arbitrary covariate effects, as well as direct modelling of covariates on the latent variable. An EM algorithm is proposed for parameter estimation and estimates of the latent variables are produced as a by-product of the analysis. A generalized likelihood ratio test can be used to test the significance of covariates affecting the latent outcomes. This method is applied to birth defects data, where the outcomes of interest are continuous measures of size and binary indicators of minor physical anomalies. Infants who were exposed in utero to anticonvulsant medications are compared with controls.  相似文献   

11.
Shi, Wang, Murray-Smith and Titterington (Biometrics 63:714–723, 2007) proposed a Gaussian process functional regression (GPFR) model to model functional response curves with a set of functional covariates. Two main problems are addressed by their method: modelling nonlinear and nonparametric regression relationship and modelling covariance structure and mean structure simultaneously. The method gives very good results for curve fitting and prediction but side-steps the problem of heterogeneity. In this paper we present a new method for modelling functional data with ‘spatially’ indexed data, i.e., the heterogeneity is dependent on factors such as region and individual patient’s information. For data collected from different sources, we assume that the data corresponding to each curve (or batch) follows a Gaussian process functional regression model as a lower-level model, and introduce an allocation model for the latent indicator variables as a higher-level model. This higher-level model is dependent on the information related to each batch. This method takes advantage of both GPFR and mixture models and therefore improves the accuracy of predictions. The mixture model has also been used for curve clustering, but focusing on the problem of clustering functional relationships between response curve and covariates, i.e. the clustering is based on the surface shape of the functional response against the set of functional covariates. The model is examined on simulated data and real data.  相似文献   

12.
13.
Abstract

A key question for understanding the cross-section of expected returns of equities is the following: which factors, from a given collection of factors, are risk factors, equivalently, which factors are in the stochastic discount factor (SDF)? Though the SDF is unobserved, assumptions about which factors (from the available set of factors) are in the SDF restricts the joint distribution of factors in specific ways, as a consequence of the economic theory of asset pricing. A different starting collection of factors that go into the SDF leads to a different set of restrictions on the joint distribution of factors. The conditional distribution of equity returns has the same restricted form, regardless of what is assumed about the factors in the SDF, as long as the factors are traded, and hence the distribution of asset returns is irrelevant for isolating the risk-factors. The restricted factors models are distinct (nonnested) and do not arise by omitting or including a variable from a full model, thus precluding analysis by standard statistical variable selection methods, such as those based on the lasso and its variants. Instead, we develop what we call a Bayesian model scan strategy in which each factor is allowed to enter or not enter the SDF and the resulting restricted models (of which there are 114,674 in our empirical study) are simultaneously confronted with the data. We use a Student-t distribution for the factors, and model-specific independent Student-t distribution for the location parameters, a training sample to fix prior locations, and a creative way to arrive at the joint distribution of several other model-specific parameters from a single prior distribution. This allows our method to be essentially a scaleable and tuned-black-box method that can be applied across our large model space with little to no user-intervention. The model marginal likelihoods, and implied posterior model probabilities, are compared with the prior probability of 1/114,674 of each model to find the best-supported model, and thus the factors most likely to be in the SDF. We provide detailed simulation evidence about the high finite-sample accuracy of the method. Our empirical study with 13 leading factors reveals that the highest marginal likelihood model is a Student-t distributed factor model with 5 degrees of freedom and 8 risk factors.  相似文献   

14.
In this paper we design a sure independent ranking and screening procedure for censored regression (cSIRS, for short) with ultrahigh dimensional covariates. The inverse probability weighted cSIRS procedure is model-free in the sense that it does not specify a parametric or semiparametric regression function between the response variable and the covariates. Thus, it is robust to model mis-specification. This model-free property is very appealing in ultrahigh dimensional data analysis, particularly when there is lack of information for the underlying regression structure. The cSIRS procedure is also robust in the presence of outliers or extreme values as it merely uses the rank of the censored response variable. We establish both the sure screening and the ranking consistency properties for the cSIRS procedure when the number of covariates p satisfies \(p=o\{\exp (an)\}\), where a is a positive constant and n is the available sample size. The advantages of cSIRS over existing competitors are demonstrated through comprehensive simulations and an application to the diffuse large-B-cell lymphoma data set.  相似文献   

15.
In this paper, we propose the use of Bayesian quantile regression for the analysis of proportion data. We also consider the case when the data present a zero-or-one inflation using a two-part model approach. For the latter scheme, we assume that the response variable is generated by a mixed discrete–continuous distribution with a point mass at zero or one. Quantile regression is then used to explain the conditional distribution of the continuous part between zero and one, while the mixture probability is also modelled as a function of the covariates. We check the performance of these models with two simulation studies. We illustrate the method with data about the proportion of households with access to electricity in Brazil.  相似文献   

16.
Multinomial logit (also termed multi-logit) models permit the analysis of the statistical relation between a categorical response variable and a set of explicative variables (called covariates or regressors). Although multinomial logit is widely used in both the social and economic sciences, the interpretation of regression coefficients may be tricky, as the effect of covariates on the probability distribution of the response variable is nonconstant and difficult to quantify. The ternary plots illustrated in this article aim at facilitating the interpretation of regression coefficients and permit the effect of covariates (either singularly or jointly considered) on the probability distribution of the dependent variable to be quantified. Ternary plots can be drawn both for ordered and for unordered categorical dependent variables, when the number of possible outcomes equals three (trinomial response variable); these plots allow not only to represent the covariate effects over the whole parameter space of the dependent variable but also to compare the covariate effects of any given individual profile. The method is illustrated and discussed through analysis of a dataset concerning the transition of master’s graduates of the University of Trento (Italy) from university to employment.  相似文献   

17.
We propose a random partition model that implements prediction with many candidate covariates and interactions. The model is based on a modified product partition model that includes a regression on covariates by favouring homogeneous clusters in terms of these covariates. Additionally, the model allows for a cluster‐specific choice of the covariates that are included in this evaluation of homogeneity. The variable selection is implemented by introducing a set of cluster‐specific latent indicators that include or exclude covariates. The proposed model is motivated by an application to predicting mortality in an intensive care unit in Lisboa, Portugal.  相似文献   

18.
Abstract

For some investments, the relation between stock returns and the market proxy is conventionally described by a linear regression model with the normality assumption. This paper derives the distribution of stock returns for a security in an upgrade (or downgrade) market with the assumption that the log stock returns of the market proxy follow a mixture of normal distributions. We discuss MLE and the method of moment estimation for parameters involved in the model. An analysis of stock data in Johannesburg Stock Exchange is included to illustrate the model. This note explains the phenomenon in financial analysis regarding the shape of the distribution of long-run stock returns limited on an upgrade or downgrade market index.  相似文献   

19.
We propose a general latent variable model for multivariate ordinal categorical variables, in which both the responses and the covariates are ordinal, to assess the effect of the covariates on the responses and to model the covariance structure of the response variables. A?fully Bayesian approach is employed to analyze the model. The Gibbs sampler is used to simulate the joint posterior distribution of the latent variables and the parameters, and the parameter expansion and reparameterization techniques are used to speed up the convergence procedure. The proposed model and method are demonstrated by simulation studies and a real data example.  相似文献   

20.
We present a class of truncated non linear regression models for location and scale where the truncated nature of the data is incorporated into the statistical model by assuming that the response variable follows a truncated distribution. The location parameter of the response variable is assumed to be modeled by a continuous non linear function of covariates and unknown parameters. In addition, the proposed model also allows for the scale parameter of the responses to be characterized by a continuous function of the covariates and unknown parameters. Three particular cases of the proposed models are presented by considering the response variable to follow a truncated normal, truncated skew normal, and truncated beta distribution. These truncated non linear regression models are constructed assuming fixed known truncation limits and model parameters are estimated by direct maximization of the log-likelihood using a non linear optimization algorithm. Standardized residuals and diagnostic metrics based on the cases deletion are considered to verify the adequacy of the model and to detect outliers and influential observations. Results based on simulated data are presented to assess the frequentist properties of estimates, and a real data set on soil-water retention from the Buriti Vermelho River Basin database is analyzed using the proposed methodology.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号