首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
We discuss the impact of tuning parameter selection uncertainty in the context of shrinkage estimation and propose a methodology to account for problems arising from this issue: Transferring established concepts from model averaging to shrinkage estimation yields the concept of shrinkage averaging estimation (SAE) which reflects the idea of using weighted combinations of shrinkage estimators with different tuning parameters to improve overall stability, predictive performance and standard errors of shrinkage estimators. Two distinct approaches for an appropriate weight choice, both of which are inspired by concepts from the recent literature of model averaging, are presented: The first approach relates to an optimal weight choice with regard to the predictive performance of the final weighted estimator and its implementation can be realized via quadratic programming. The second approach has a fairly different motivation and considers the construction of weights via a resampling experiment. Focusing on Ridge, Lasso and Random Lasso estimators, the properties of the proposed shrinkage averaging estimators resulting from these strategies are explored by means of Monte-Carlo studies and are compared to traditional approaches where the tuning parameter is simply selected via cross validation criteria. The results show that the proposed SAE methodology can improve an estimators’ overall performance and reveal and incorporate tuning parameter uncertainty. As an illustration, selected methods are applied to some recent data from a study on leadership behavior in life science companies.  相似文献   

Variable and model selection problems are fundamental to high-dimensional statistical modeling in diverse fields of sciences. Especially in health studies, many potential factors are usually introduced to determine an outcome variable. This paper deals with the problem of high-dimensional statistical modeling through the analysis of the trauma annual data in Greece for 2005. The data set is divided into the experiment and control sets and consists of 6334 observations and 112 factors that include demographic, transport and intrahospital data used to detect possible risk factors of death. In our study, different model selection techniques are applied to the experiment set and the notion of deviance is used on the control set to assess the fit of the overall selected model. The statistical methods employed in this work were the non-concave penalized likelihood methods, smoothly clipped absolute deviation, least absolute shrinkage and selection operator, and Hard, the generalized linear logistic regression, and the best subset variable selection.The way of identifying the significant variables in large medical data sets along with the performance and the pros and cons of the various statistical techniques used are discussed. The performed analysis reveals the distinct advantages of the non-concave penalized likelihood methods over the traditional model selection techniques.  相似文献   

在工资差距分解问题中,研究者经常会遇到样本选择偏差问题,直接忽略会导致最终估计结果产生严重偏差,同时在众多工资差距分解方法中,相比于均值分解,分布分解方法更受研究者青睐。针对参数分位回归,本文首次提出可加形式与非可加形式的样本选择参数分位回归(SSPQR)模型,并基于这两类样本选择参数分位回归模型给出修正样本选择偏差后的参数分位回归工资差距分布分解方法。运用上述方法及已有的工资分布分解方法,借助CHNS2015年度城镇数据,本文研究了我国城镇男女工资差距及差距分解问题,得出以下结论:①男女工资差距主要来源是性别歧视问题;②经过样本选择偏差修正后,实际的工资差距更大,歧视问题更严重;③男女工资差距程度在不同分位点上结果不同,换句话说,我们不能简单地仅从平均水平来判断工资差距程度;④与其他已有方法计算结果比较发现,SSPQR计算的工资差距程度更大。  相似文献   

To be useful to clinicians, prognostic and diagnostic indices must be derived from accurate models developed by using appropriate data sets. We show that fractional polynomials, which extend ordinary polynomials by including non-positive and fractional powers, may be used as the basis of such models. We describe how to fit fractional polynomials in several continuous covariates simultaneously, and we propose ways of ensuring that the resulting models are parsimonious and consistent with basic medical knowledge. The methods are applied to two breast cancer data sets, one from a prognostic factors study in patients with positive lymph nodes and the other from a study to diagnose malignant or benign tumours by using colour Doppler blood flow mapping. We investigate the problems of biased parameter estimates in the final model and overfitting using cross-validation calibration to estimate shrinkage factors. We adopt bootstrap resampling to assess model stability. We compare our new approach with conventional modelling methods which apply stepwise variables selection to categorized covariates. We conclude that fractional polynomial methodology can be very successful in generating simple and appropriate models.  相似文献   

This paper surveys various shrinkage, smoothing and selection priors from a unifying perspective and shows how to combine them for Bayesian regularisation in the general class of structured additive regression models. As a common feature, all regularisation priors are conditionally Gaussian, given further parameters regularising model complexity. Hyperpriors for these parameters encourage shrinkage, smoothness or selection. It is shown that these regularisation (log-) priors can be interpreted as Bayesian analogues of several well-known frequentist penalty terms. Inference can be carried out with unified and computationally efficient MCMC schemes, estimating regularised regression coefficients and basis function coefficients simultaneously with complexity parameters and measuring uncertainty via corresponding marginal posteriors. For variable and function selection we discuss several variants of spike and slab priors which can also be cast into the framework of conditionally Gaussian priors. The performance of the Bayesian regularisation approaches is demonstrated in a hazard regression model and a high-dimensional geoadditive regression model.  相似文献   

We consider multiple comparisons of log-likelihood's to take account of the multiplicity of testings in selection of nonnested models. A resampling version of the Gupta procedure for the selection problem is used to obtain a set of good models, which are not significantly worse than the maximum likelihood model; i.e., a confidence set of models. Our method is to test which model is better than the other, while the object of the classical testing methods is to find the correct model. Thus the null hypotheses behind these two approaches are very different. Our method and the other commonly used approaches, such as the approximate Bayesian posterior, the bootstrap selection probability, and the LR test against the full model, are applied to the selection of molecular phylogenetic tree of mammal species. Tree selection is a version of the model-based clustering, which is an example of nonnested model selection. It is shown that the structure of the tree selection problem is equivalent to that of the variable selection problem of the multiple regression with some constraints on the combinations of the variables. It turns out that the LR test rejects all the possible trees because of the misspecification of the models, whereas our method gives a reasonable confidence set. For a better understanding of the uncertainty in the selection, we combine the maximum likelihood estimates (MLE's) of the trees to obtain the full model that includes the trees as the submodels by using a linear approximation of the parametric models. The MLE of the phylogeny is then represented as a network of species rather than a tree. A geometrical interpretation of the problem is also discussed.  相似文献   

The goal of this paper is to compare several widely used Bayesian model selection methods in practical model selection problems, highlight their differences and give recommendations about the preferred approaches. We focus on the variable subset selection for regression and classification and perform several numerical experiments using both simulated and real world data. The results show that the optimization of a utility estimate such as the cross-validation (CV) score is liable to finding overfitted models due to relatively high variance in the utility estimates when the data is scarce. This can also lead to substantial selection induced bias and optimism in the performance evaluation for the selected model. From a predictive viewpoint, best results are obtained by accounting for model uncertainty by forming the full encompassing model, such as the Bayesian model averaging solution over the candidate models. If the encompassing model is too complex, it can be robustly simplified by the projection method, in which the information of the full model is projected onto the submodels. This approach is substantially less prone to overfitting than selection based on CV-score. Overall, the projection method appears to outperform also the maximum a posteriori model and the selection of the most probable variables. The study also demonstrates that the model selection can greatly benefit from using cross-validation outside the searching process both for guiding the model size selection and assessing the predictive performance of the finally selected model.  相似文献   

In high-dimensional data settings, sparse model fits are desired, which can be obtained through shrinkage or boosting techniques. We investigate classical shrinkage techniques such as the lasso, which is theoretically known to be biased, new techniques that address this problem, such as elastic net and SCAD, and boosting technique CoxBoost and extensions of it, which allow to incorporate additional structure. To examine, whether these methods, that are designed for or frequently used in high-dimensional survival data analysis, provide sensible results in low-dimensional data settings as well, we consider the well known GBSG breast cancer data. In detail, we study the bias, stability and sparseness of these model fitting techniques via comparison to the maximum likelihood estimate and resampling, and their prediction performance via prediction error curve estimates.  相似文献   

王小燕等 《统计研究》2014,31(9):107-112
变量选择是统计建模的重要环节,选择合适的变量可以建立结构简单、预测精准的稳健模型。本文在logistic回归下提出了新的双层变量选择惩罚方法——adaptive Sparse Group Lasso(adSGL),其独特之处在于基于变量的分组结构作筛选,实现了组内和组间双层选择。该方法的优点是对各单个系数和组系数采取不同程度的惩罚,避免了过度惩罚大系数,从而提高了模型的估计和预测精度。求解的难点是惩罚似然函数不是严格凸的,因此本文基于组坐标下降法求解模型,并建立了调整参数的选取准则。模拟分析表明,对比现有代表性方法Sparse Group Lasso、Group Lasso及Lasso,adSGL法不仅提高了双层选择精度,而且降低了模型误差。最后本文将adSGL法应用到信用卡信用评分研究,对比logistic回归,它具有更高的分类精度和稳健性。  相似文献   

Nonparametric seemingly unrelated regression provides a powerful alternative to parametric seemingly unrelated regression for relaxing the linearity assumption. The existing methods are limited, particularly with sharp changes in the relationship between the predictor variables and the corresponding response variable. We propose a new nonparametric method for seemingly unrelated regression, which adopts a tree-structured regression framework, has satisfiable prediction accuracy and interpretability, no restriction on the inclusion of categorical variables, and is less vulnerable to the curse of dimensionality. Moreover, an important feature is constructing a unified tree-structured model for multivariate data, even though the predictor variables corresponding to the response variable are entirely different. This unified model can offer revelatory insights such as underlying economic meaning. We propose the key factors of tree-structured regression, which are an impurity function detecting complex nonlinear relationships between the predictor variables and the response variable, split rule selection with negligible selection bias, and tree size determination solving underfitting and overfitting problems. We demonstrate our proposed method using simulated data and illustrate it using data from the Korea stock exchange sector indices.  相似文献   

Hea-Jung Kim  Taeyoung Roh 《Statistics》2013,47(5):1082-1111
In regression analysis, a sample selection scheme often applies to the response variable, which results in missing not at random observations on the variable. In this case, a regression analysis using only the selected cases would lead to biased results. This paper proposes a Bayesian methodology to correct this bias based on a semiparametric Bernstein polynomial regression model that incorporates the sample selection scheme into a stochastic monotone trend constraint, variable selection, and robustness against departures from the normality assumption. We present the basic theoretical properties of the proposed model that include its stochastic representation, sample selection bias quantification, and hierarchical model specification to deal with the stochastic monotone trend constraint in the nonparametric component, simple bias corrected estimation, and variable selection for the linear components. We then develop computationally feasible Markov chain Monte Carlo methods for semiparametric Bernstein polynomial functions with stochastically constrained parameter estimation and variable selection procedures. We demonstrate the finite-sample performance of the proposed model compared to existing methods using simulation studies and illustrate its use based on two real data applications.  相似文献   

We propose a robust regression method called regression with outlier shrinkage (ROS) for the traditional n>pn>p cases. It improves over the other robust regression methods such as least trimmed squares (LTS) in the sense that it can achieve maximum breakdown value and full asymptotic efficiency simultaneously. Moreover, its computational complexity is no more than that of LTS. We also propose a sparse estimator, called sparse regression with outlier shrinkage (SROS), for robust variable selection and estimation. It is proven that SROS can not only give consistent selection but also estimate the nonzero coefficients with full asymptotic efficiency under the normal model. In addition, we introduce a concept of nearly regression equivariant estimator for understanding the breakdown properties of sparse estimators, and prove that SROS achieves the maximum breakdown value of nearly regression equivariant estimators. Numerical examples are presented to illustrate our methods.  相似文献   

Summary.  Problems of the analysis of data with incomplete observations are all too familiar in statistics. They are doubly difficult if we are also uncertain about the choice of model. We propose a general formulation for the discussion of such problems and develop approximations to the resulting bias of maximum likelihood estimates on the assumption that model departures are small. Loss of efficiency in parameter estimation due to incompleteness in the data has a dual interpretation: the increase in variance when an assumed model is correct; the bias in estimation when the model is incorrect. Examples include non-ignorable missing data, hidden confounders in observational studies and publication bias in meta-analysis. Doubling variances before calculating confidence intervals or test statistics is suggested as a crude way of addressing the possibility of undetectably small departures from the model. The problem of assessing the risk of lung cancer from passive smoking is used as a motivating example.  相似文献   

When variable selection with stepwise regression and model fitting are conducted on the same data set, competition for inclusion in the model induces a selection bias in coefficient estimators away from zero. In proportional hazards regression with right-censored data, selection bias inflates the absolute value of parameter estimate of selected parameters, while the omission of other variables may shrink coefficients toward zero. This paper explores the extent of the bias in parameter estimates from stepwise proportional hazards regression and proposes a bootstrap method, similar to those proposed by Miller (Subset Selection in Regression, 2nd edn. Chapman & Hall/CRC, 2002) for linear regression, to correct for selection bias. We also use bootstrap methods to estimate the standard error of the adjusted estimators. Simulation results show that substantial biases could be present in uncorrected stepwise estimators and, for binary covariates, could exceed 250% of the true parameter value. The simulations also show that the conditional mean of the proposed bootstrap bias-corrected parameter estimator, given that a variable is selected, is moved closer to the unconditional mean of the standard partial likelihood estimator in the chosen model, and to the population value of the parameter. We also explore the effect of the adjustment on estimates of log relative risk, given the values of the covariates in a selected model. The proposed method is illustrated with data sets in primary biliary cirrhosis and in multiple myeloma from the Eastern Cooperative Oncology Group.  相似文献   

Summary.  Contemporary statistical research frequently deals with problems involving a diverging number of parameters. For those problems, various shrinkage methods (e.g. the lasso and smoothly clipped absolute deviation) are found to be particularly useful for variable selection. Nevertheless, the desirable performances of those shrinkage methods heavily hinge on an appropriate selection of the tuning parameters. With a fixed predictor dimension, Wang and co-worker have demonstrated that the tuning parameters selected by a Bayesian information criterion type criterion can identify the true model consistently. In this work, similar results are further extended to the situation with a diverging number of parameters for both unpenalized and penalized estimators. Consequently, our theoretical results further enlarge not only the scope of applicabilityation criterion type criteria but also that of those shrinkage estimation methods.  相似文献   

The major problem of mean–variance portfolio optimization is parameter uncertainty. Many methods have been proposed to tackle this problem, including shrinkage methods, resampling techniques, and imposing constraints on the portfolio weights, etc. This paper suggests a new estimation method for mean–variance portfolio weights based on the concept of generalized pivotal quantity (GPQ) in the case when asset returns are multivariate normally distributed and serially independent. Both point and interval estimations of the portfolio weights are considered. Comparing with Markowitz's mean–variance model, resampling and shrinkage methods, we find that the proposed GPQ method typically yields the smallest mean-squared error for the point estimate of the portfolio weights and obtains a satisfactory coverage rate for their simultaneous confidence intervals. Finally, we apply the proposed methodology to address a portfolio rebalancing problem.  相似文献   

Often in observational studies of time to an event, the study population is a biased (i.e., unrepresentative) sample of the target population. In the presence of biased samples, it is common to weight subjects by the inverse of their respective selection probabilities. Pan and Schaubel (Can J Stat 36:111–127, 2008) recently proposed inference procedures for an inverse selection probability weighted (ISPW) Cox model, applicable when selection probabilities are not treated as fixed but estimated empirically. The proposed weighting procedure requires auxiliary data to estimate the weights and is computationally more intense than unweighted estimation. The ignorability of sample selection process in terms of parameter estimators and predictions is often of interest, from several perspectives: e.g., to determine if weighting makes a significant difference to the analysis at hand, which would in turn address whether the collection of auxiliary data is required in future studies; to evaluate previous studies which did not correct for selection bias. In this article, we propose methods to quantify the degree of bias corrected by the weighting procedure in the partial likelihood and Breslow-Aalen estimators. Asymptotic properties of the proposed test statistics are derived. The finite-sample significance level and power are evaluated through simulation. The proposed methods are then applied to data from a national organ failure registry to evaluate the bias in a post-kidney transplant survival model.  相似文献   

In the regression model with censored data, it is not straightforward to estimate the covariances of the regression estimators, since their asymptotic covariances may involve the unknown error density function and its derivative. In this article, a resampling method for making inferences on the parameter, based on some estimating functions, is discussed for the censored regression model. The inference procedures are associated with a weight function. To find the best weight functions for the proposed procedures, extensive simulations are performed. The validity of the approximation to the distribution of the estimator by a resampling technique is also examined visually. Implementation of the procedures is discussed and illustrated in a real data example.  相似文献   


Regression spline smoothing is a popular approach for conducting nonparametric regression. An important issue associated with it is the choice of a "theoretically best" set of knots. Different statistical model selection methods, such as Akaike's information criterion and generalized cross-validation, have been applied to derive different "theoretically best" sets of knots. Typically these best knot sets are defined implicitly as the optimizers of some objective functions. Hence another equally important issue concerning regression spline smoothing is how to optimize such objective functions. In this article different numerical algorithms that are designed for carrying out such optimization problems are compared by means of a simulation study. Both the univariate and bivariate smoothing settings will be considered. Based on the simulation results, recommendations for choosing a suitable optimization algorithm under various settings will be provided.  相似文献   

Penalised likelihood methods, such as the least absolute shrinkage and selection operator (Lasso) and the smoothly clipped absolute deviation penalty, have become widely used for variable selection in recent years. These methods impose penalties on regression coefficients to shrink a subset of them towards zero to achieve parameter estimation and model selection simultaneously. The amount of shrinkage is controlled by the regularisation parameter. Popular approaches for choosing the regularisation parameter include cross‐validation, various information criteria and bootstrapping methods that are based on mean square error. In this paper, a new data‐driven method for choosing the regularisation parameter is proposed and the consistency of the method is established. It holds not only for the usual fixed‐dimensional case but also for the divergent setting. Simulation results show that the new method outperforms other popular approaches. An application of the proposed method to motif discovery in gene expression analysis is included in this paper.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号