首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 296 毫秒
1.
When employing model selection methods with oracle properties such as the smoothly clipped absolute deviation (SCAD) and the Adaptive Lasso, it is typical to estimate the smoothing parameter by m-fold cross-validation, for example, m = 10. In problems where the true regression function is sparse and the signals large, such cross-validation typically works well. However, in regression modeling of genomic studies involving Single Nucleotide Polymorphisms (SNP), the true regression functions, while thought to be sparse, do not have large signals. We demonstrate empirically that in such problems, the number of selected variables using SCAD and the Adaptive Lasso, with 10-fold cross-validation, is a random variable that has considerable and surprising variation. Similar remarks apply to non-oracle methods such as the Lasso. Our study strongly questions the suitability of performing only a single run of m-fold cross-validation with any oracle method, and not just the SCAD and Adaptive Lasso.  相似文献   

2.
We propose two new procedures based on multiple hypothesis testing for correct support estimation in high‐dimensional sparse linear models. We conclusively prove that both procedures are powerful and do not require the sample size to be large. The first procedure tackles the atypical setting of ordered variable selection through an extension of a testing procedure previously developed in the context of a linear hypothesis. The second procedure is the main contribution of this paper. It enables data analysts to perform support estimation in the general high‐dimensional framework of non‐ordered variable selection. A thorough simulation study and applications to real datasets using the R package mht shows that our non‐ordered variable procedure produces excellent results in terms of correct support estimation as well as in terms of mean square errors and false discovery rate, when compared to common methods such as the Lasso, the SCAD penalty, forward regression or the false discovery rate procedure (FDR).  相似文献   

3.
Abstract.  This paper considers covariate selection for the additive hazards model. This model is particularly simple to study theoretically and its practical implementation has several major advantages to the similar methodology for the proportional hazards model. One complication compared with the proportional model is, however, that there is no simple likelihood to work with. We here study a least squares criterion with desirable properties and show how this criterion can be interpreted as a prediction error. Given this criterion, we define ridge and Lasso estimators as well as an adaptive Lasso and study their large sample properties for the situation where the number of covariates p is smaller than the number of observations. We also show that the adaptive Lasso has the oracle property. In many practical situations, it is more relevant to tackle the situation with large p compared with the number of observations. We do this by studying the properties of the so-called Dantzig selector in the setting of the additive risk model. Specifically, we establish a bound on how close the solution is to a true sparse signal in the case where the number of covariates is large. In a simulation study, we also compare the Dantzig and adaptive Lasso for a moderate to small number of covariates. The methods are applied to a breast cancer data set with gene expression recordings and to the primary biliary cirrhosis clinical data.  相似文献   

4.
A number of nonstationary models have been developed to estimate extreme events as function of covariates. A quantile regression (QR) model is a statistical approach intended to estimate and conduct inference about the conditional quantile functions. In this article, we focus on the simultaneous variable selection and parameter estimation through penalized quantile regression. We conducted a comparison of regularized Quantile Regression model with B-Splines in Bayesian framework. Regularization is based on penalty and aims to favor parsimonious model, especially in the case of large dimension space. The prior distributions related to the penalties are detailed. Five penalties (Lasso, Ridge, SCAD0, SCAD1 and SCAD2) are considered with their equivalent expressions in Bayesian framework. The regularized quantile estimates are then compared to the maximum likelihood estimates with respect to the sample size. A Markov Chain Monte Carlo (MCMC) algorithms are developed for each hierarchical model to simulate the conditional posterior distribution of the quantiles. Results indicate that the SCAD0 and Lasso have the best performance for quantile estimation according to Relative Mean Biais (RMB) and the Relative Mean-Error (RME) criteria, especially in the case of heavy distributed errors. A case study of the annual maximum precipitation at Charlo, Eastern Canada, with the Pacific North Atlantic climate index as covariate is presented.  相似文献   

5.
Analysis of high dimensional data often seeks to identify a subset of important features and assess their effects on the outcome. Traditional statistical inference procedures based on standard regression methods often fail in the presence of high-dimensional features. In recent years, regularization methods have emerged as promising tools for analyzing high dimensional data. These methods simultaneously select important features and provide stable estimation of their effects. Adaptive LASSO and SCAD for instance, give consistent and asymptotically normal estimates with oracle properties. However, in finite samples, it remains difficult to obtain interval estimators for the regression parameters. In this paper, we propose perturbation resampling based procedures to approximate the distribution of a general class of penalized parameter estimates. Our proposal, justified by asymptotic theory, provides a simple way to estimate the covariance matrix and confidence regions. Through finite sample simulations, we verify the ability of this method to give accurate inference and compare it to other widely used standard deviation and confidence interval estimates. We also illustrate our proposals with a data set used to study the association of HIV drug resistance and a large number of genetic mutations.  相似文献   

6.
Penalized likelihood approaches are widely used for high-dimensional regression. Although many methods have been proposed and the associated theory is now well developed, the relative efficacy of different approaches in finite-sample settings, as encountered in practice, remains incompletely understood. There is therefore a need for empirical investigations in this area that can offer practical insight and guidance to users. In this paper, we present a large-scale comparison of penalized regression methods. We distinguish between three related goals: prediction, variable selection and variable ranking. Our results span more than 2300 data-generating scenarios, including both synthetic and semisynthetic data (real covariates and simulated responses), allowing us to systematically consider the influence of various factors (sample size, dimensionality, sparsity, signal strength and multicollinearity). We consider several widely used approaches (Lasso, Adaptive Lasso, Elastic Net, Ridge Regression, SCAD, the Dantzig Selector and Stability Selection). We find considerable variation in performance between methods. Our results support a “no panacea” view, with no unambiguous winner across all scenarios or goals, even in this restricted setting where all data align well with the assumptions underlying the methods. The study allows us to make some recommendations as to which approaches may be most (or least) suitable given the goal and some data characteristics. Our empirical results complement existing theory and provide a resource to compare methods across a range of scenarios and metrics.  相似文献   

7.
We consider estimation in a high-dimensional linear model with strongly correlated variables. We propose to cluster the variables first and do subsequent sparse estimation such as the Lasso for cluster-representatives or the group Lasso based on the structure from the clusters. Regarding the first step, we present a novel and bottom-up agglomerative clustering algorithm based on canonical correlations, and we show that it finds an optimal solution and is statistically consistent. We also present some theoretical arguments that canonical correlation based clustering leads to a better-posed compatibility constant for the design matrix which ensures identifiability and an oracle inequality for the group Lasso. Furthermore, we discuss circumstances where cluster-representatives and using the Lasso as subsequent estimator leads to improved results for prediction and detection of variables. We complement the theoretical analysis with various empirical results.  相似文献   

8.
We consider the problem of variables selection and estimation in linear regression model in situations where the number of parameters diverges with the sample size. We propose the adaptive Generalized Ridge-Lasso (\mboxAdaGril) which is an extension of the the adaptive Elastic Net. AdaGril incorporates information redundancy among correlated variables for model selection and estimation. It combines the strengths of the quadratic regularization and the adaptively weighted Lasso shrinkage. In this article, we highlight the grouped selection property for AdaCnet method (one type of AdaGril) in the equal correlation case. Under weak conditions, we establish the oracle property of AdaGril which ensures the optimal large performance when the dimension is high. Consequently, it achieves both goals of handling the problem of collinearity in high dimension and enjoys the oracle property. Moreover, we show that AdaGril estimator achieves a Sparsity Inequality, i.e., a bound in terms of the number of non-zero components of the “true” regression coefficient. This bound is obtained under a similar weak Restricted Eigenvalue (RE) condition used for Lasso. Simulations studies show that some particular cases of AdaGril outperform its competitors.  相似文献   

9.
Liang H  Liu X  Li R  Tsai CL 《Annals of statistics》2010,38(6):3811-3836
In partially linear single-index models, we obtain the semiparametrically efficient profile least-squares estimators of regression coefficients. We also employ the smoothly clipped absolute deviation penalty (SCAD) approach to simultaneously select variables and estimate regression coefficients. We show that the resulting SCAD estimators are consistent and possess the oracle property. Subsequently, we demonstrate that a proposed tuning parameter selector, BIC, identifies the true model consistently. Finally, we develop a linear hypothesis test for the parametric coefficients and a goodness-of-fit test for the nonparametric component, respectively. Monte Carlo studies are also presented.  相似文献   

10.
In the paper we consider minimisation of U-statistics with the weighted Lasso penalty and investigate their asymptotic properties in model selection and estimation. We prove that the use of appropriate weights in the penalty leads to the procedure that behaves like the oracle that knows the true model in advance, i.e. it is model selection consistent and estimates nonzero parameters with the standard rate. For the unweighted Lasso penalty, we obtain sufficient and necessary conditions for model selection consistency of estimators. The obtained results strongly based on the convexity of the loss function that is the main assumption of the paper. Our theorems can be applied to the ranking problem as well as generalised regression models. Thus, using U-statistics we can study more complex models (better describing real problems) than usually investigated linear or generalised linear models.  相似文献   

11.
随着计算机的飞速发展,极大地便利了数据的获取和存储,很多企业积累了大量的数据,同时数据的维度也越来越高,噪声变量越来越多,因此在建模分析时面临的重要问题之一就是从高维的变量中筛选出少数的重要变量。针对因变量取值为(0,1)区间的比例数据提出了正则化Beta回归,研究了在LASSO、SCAD和MCP三种惩罚方法下的极大似然估计及其渐进性质。统计模拟表明MCP的方法会优于SCAD和LASSO,并且随着样本量的增大,SCAD的方法也将优于LASSO。最后,将该方法应用到中国上市公司股息率的影响因素研究中。  相似文献   

12.
通常所说的Granger因果关系检验,实际上是对线性因果关系的检验,无法检验非线性因果关系。Peguin和Terasvirta(1999)进行了基于泰勒展式的一般性扩展,应用于非线性因果关系检验,并采用提取主成分的方法解决其中的多重共线性问题。但是,提取主成分对解决多重共线性的效果并不太好。Lasso回归是目前处理多重共线性的主要方法之一,相对于其他方法,更容易产生稀疏解,在参数估计的同时实现变量选择,因而可以用来解决检验中的多重共线性问题,以提高检验的效率。对检验程序的模拟结果表明,基于Lasso回归的检验取得较好的效果。  相似文献   

13.
The Lasso has sparked interest in the use of penalization of the log‐likelihood for variable selection, as well as for shrinkage. We are particularly interested in the more‐variables‐than‐observations case of characteristic importance for modern data. The Bayesian interpretation of the Lasso as the maximum a posteriori estimate of the regression coefficients, which have been given independent, double exponential prior distributions, is adopted. Generalizing this prior provides a family of hyper‐Lasso penalty functions, which includes the quasi‐Cauchy distribution of Johnstone and Silverman as a special case. The properties of this approach, including the oracle property, are explored, and an EM algorithm for inference in regression problems is described. The posterior is multi‐modal, and we suggest a strategy of using a set of perfectly fitting random starting values to explore modes in different regions of the parameter space. Simulations show that our procedure provides significant improvements on a range of established procedures, and we provide an example from chemometrics.  相似文献   

14.
Abstract

In this paper, we propose a variable selection method for quantile regression model in ultra-high dimensional longitudinal data called as the weighted adaptive robust lasso (WAR-Lasso) which is double-robustness. We derive the consistency and the model selection oracle property of WAR-Lasso. Simulation studies show the double-robustness of WAR-Lasso in both cases of heavy-tailed distribution of the errors and the heavy contaminations of the covariates. WAR-Lasso outperform other methods such as SCAD and etc. A real data analysis is carried out. It shows that WAR-Lasso tends to select fewer variables and the estimated coefficients are in line with economic significance.  相似文献   

15.
It is well known that M-estimation is a widely used method for robust statistical inference and the varying coefficient models have been widely applied in many scientific areas. In this paper, we consider M-estimation and model identification of bivariate varying coefficient models for longitudinal data. We make use of bivariate tensor-product B-splines as an approximation of the function and consider M-type regression splines by minimizing the objective convex function. Mean and median regressions are included in this class. Moreover, with a double smoothly clipped absolute deviation (SCAD) penalization, we study the problem of simultaneous structure identification and estimation. Under approximate conditions, we show that the proposed procedure possesses the oracle property in the sense that it is as efficient as the estimator when the true model is known prior to statistical analysis. Simulation studies are carried out to demonstrate the methodological power of the proposed methods with finite samples. The proposed methodology is illustrated with an analysis of a real data example.  相似文献   

16.
In this article, we present a new efficient iteration estimation approach based on local modal regression for single-index varying-coefficient models. The resulted estimators are shown to be robust with regardless of outliers and error distributions. The asymptotic properties of the estimators are established under some regularity conditions and a practical modified EM algorithm is proposed for the new method. Moreover, to achieve sparse estimator when there exists irrelevant variables in the index parameters, a variable selection procedure based on SCAD penalty is developed to select significant parametric covariates and the well-known oracle properties are also derived. Finally, some numerical examples with various distributed errors and a real data analysis are conducted to illustrate the validity and feasibility of our proposed method.  相似文献   

17.
Penalization has been extensively adopted for variable selection in regression. In some applications, covariates have natural grouping structures, where those in the same group have correlated measurements or related functions. Under such settings, variable selection should be conducted at both the group-level and within-group-level, that is, a bi-level selection. In this study, we propose the adaptive sparse group Lasso (adSGL) method, which combines the adaptive Lasso and adaptive group Lasso (GL) to achieve bi-level selection. It can be viewed as an improved version of sparse group Lasso (SGL) and uses data-dependent weights to improve selection performance. For computation, a block coordinate descent algorithm is adopted. Simulation shows that adSGL has satisfactory performance in identifying both individual variables and groups and lower false discovery rate and mean square error than SGL and GL. We apply the proposed method to the analysis of a household healthcare expenditure data set.  相似文献   

18.
We study partial linear single-index models (PLSiMs) when the response and the covariates in the parametric part are measured with additive distortion measurement errors. These distortions are modeled by unknown functions of a commonly observable confounding variable. We use the semiparametric profile least-squares method to estimate the parameters in the PLSiMs based on the residuals obtained from the distorted variables and confounding variable. We also employ the smoothly clipped absolute deviation penalty (SCAD) to select the relevant variables in the PLSiMs. We show that the resulting SCAD estimators are consistent and possess the oracle property. For the non parametric link function, we construct the simultaneous confidence bands and obtain the asymptotic distribution of the maximum absolute deviation between the estimated link function and the true link function. A simulation study is conducted to evaluate the performance of the proposed methods and a real dataset is analyzed for illustration.  相似文献   

19.
Abstract

Structured sparsity has recently been a very popular technique to deal with the high-dimensional data. In this paper, we mainly focus on the theoretical problems for the overlapping group structure of generalized linear models (GLMs). Although the overlapping group lasso method for GLMs has been widely applied in some applications, the theoretical properties about it are still unknown. Under some general conditions, we presents the oracle inequalities for the estimation and prediction error of overlapping group Lasso method in the generalized linear model setting. Then, we apply these results to the so-called Logistic and Poisson regression models. It is shown that the results of the Lasso and group Lasso procedures for GLMs can be recovered by specifying the group structures in our proposed method. The effect of overlap and the performance of variable selection of our proposed method are both studied by numerical simulations. Finally, we apply our proposed method to two gene expression data sets: the p53 data and the lung cancer data.  相似文献   

20.
ABSTRACT

Supersaturated designs (SSDs) constitute a large class of fractional factorial designs which can be used for screening out the important factors from a large set of potentially active ones. A major advantage of these designs is that they reduce the experimental cost dramatically, but their crucial disadvantage is the confounding involved in the statistical analysis. Identification of active effects in SSDs has been the subject of much recent study. In this article we present a two-stage procedure for analyzing two-level SSDs assuming a main-effect only model, without including any interaction terms. This method combines sure independence screening (SIS) with different penalty functions; such as Smoothly Clipped Absolute Deviation (SCAD), Lasso and MC penalty achieving both the down-selection and the estimation of the significant effects, simultaneously. Insights on using the proposed methodology are provided through various simulation scenarios and several comparisons with existing approaches, such as stepwise in combination with SCAD and Dantzig Selector (DS) are presented as well. Results of the numerical study and real data analysis reveal that the proposed procedure can be considered as an advantageous tool due to its extremely good performance for identifying active factors.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号