首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A challenging problem in the analysis of high-dimensional data is variable selection. In this study, we describe a bootstrap based technique for selecting predictors in partial least-squares regression (PLSR) and principle component regression (PCR) in high-dimensional data. Using a bootstrap-based technique for significance tests of the regression coefficients, a subset of the original variables can be selected to be included in the regression, thus obtaining a more parsimonious model with smaller prediction errors. We compare the bootstrap approach with several variable selection approaches (jack-knife and sparse formulation-based methods) on PCR and PLSR in simulation and real data.  相似文献   

2.
We consider the problem of variable selection in high-dimensional partially linear models with longitudinal data. A variable selection procedure is proposed based on the smooth-threshold generalized estimating equation (SGEE). The proposed procedure automatically eliminates inactive predictors by setting the corresponding parameters to be zero, and simultaneously estimates the nonzero regression coefficients by solving the SGEE. We establish the asymptotic properties in a high-dimensional framework where the number of covariates pn increases as the number of clusters n increases. Extensive Monte Carlo simulation studies are conducted to examine the finite sample performance of the proposed variable selection procedure.  相似文献   

3.
Efficient statistical inference on nonignorable missing data is a challenging problem. This paper proposes a new estimation procedure based on composite quantile regression (CQR) for linear regression models with nonignorable missing data, that is applicable even with high-dimensional covariates. A parametric model is assumed for modelling response probability, which is estimated by the empirical likelihood approach. Local identifiability of the proposed strategy is guaranteed on the basis of an instrumental variable approach. A set of data-based adaptive weights constructed via an empirical likelihood method is used to weight CQR functions. The proposed method is resistant to heavy-tailed errors or outliers in the response. An adaptive penalisation method for variable selection is proposed to achieve sparsity with high-dimensional covariates. Limiting distributions of the proposed estimators are derived. Simulation studies are conducted to investigate the finite sample performance of the proposed methodologies. An application to the ACTG 175 data is analysed.  相似文献   

4.
This paper studies a fast computational algorithm for variable selection on high-dimensional recurrent event data. Based on the lasso penalized partial likelihood function for the response process of recurrent event data, a coordinate descent algorithm is used to accelerate the estimation of regression coefficients. This algorithm is capable of selecting important predictors for underdetermined problems where the number of predictors far exceeds the number of cases. The selection strength is controlled by a tuning constant that is determined by a generalized cross-validation method. Our numerical experiments on simulated and real data demonstrate the good performance of penalized regression in model building for recurrent event data in high-dimensional settings.  相似文献   

5.
Variable and model selection problems are fundamental to high-dimensional statistical modeling in diverse fields of sciences. Especially in health studies, many potential factors are usually introduced to determine an outcome variable. This paper deals with the problem of high-dimensional statistical modeling through the analysis of the trauma annual data in Greece for 2005. The data set is divided into the experiment and control sets and consists of 6334 observations and 112 factors that include demographic, transport and intrahospital data used to detect possible risk factors of death. In our study, different model selection techniques are applied to the experiment set and the notion of deviance is used on the control set to assess the fit of the overall selected model. The statistical methods employed in this work were the non-concave penalized likelihood methods, smoothly clipped absolute deviation, least absolute shrinkage and selection operator, and Hard, the generalized linear logistic regression, and the best subset variable selection.The way of identifying the significant variables in large medical data sets along with the performance and the pros and cons of the various statistical techniques used are discussed. The performed analysis reveals the distinct advantages of the non-concave penalized likelihood methods over the traditional model selection techniques.  相似文献   

6.
This paper is about variable selection with the random forests algorithm in presence of correlated predictors. In high-dimensional regression or classification frameworks, variable selection is a difficult task, that becomes even more challenging in the presence of highly correlated predictors. Firstly we provide a theoretical study of the permutation importance measure for an additive regression model. This allows us to describe how the correlation between predictors impacts the permutation importance. Our results motivate the use of the recursive feature elimination (RFE) algorithm for variable selection in this context. This algorithm recursively eliminates the variables using permutation importance measure as a ranking criterion. Next various simulation experiments illustrate the efficiency of the RFE algorithm for selecting a small number of variables together with a good prediction error. Finally, this selection algorithm is tested on the Landsat Satellite data from the UCI Machine Learning Repository.  相似文献   

7.
Sliced regression is an effective dimension reduction method by replacing the original high-dimensional predictors with its appropriate low-dimensional projection. It is free from any probabilistic assumption and can exhaustively estimate the central subspace. In this article, we propose to incorporate shrinkage estimation into sliced regression so that variable selection can be achieved simultaneously with dimension reduction. The new method can improve the estimation accuracy and achieve better interpretability for the reduced variables. The efficacy of proposed method is shown through both simulation and real data analysis.  相似文献   

8.
Variable selection problem is one of the most important tasks in regression analysis, especially in a high-dimensional setting. In this paper, we study this problem in the context of scalar response functional regression model, which is a linear model with scalar response and functional regressors. The functional model can be represented by certain multiple linear regression model via basis expansions of functional variables. Based on this model and random subspace method of Mielniczuk and Teisseyre (Comput Stat Data Anal 71:725–742, 2014), two simple variable selection procedures for scalar response functional regression model are proposed. The final functional model is selected by using generalized information criteria. Monte Carlo simulation studies conducted and a real data example show very satisfactory performance of new variable selection methods under finite samples. Moreover, they suggest that considered procedures outperform solutions found in the literature in terms of correctly selected model, false discovery rate control and prediction error.  相似文献   

9.
孙怡帆等 《统计研究》2021,38(5):136-146
随着信息技术的发展,高维数据日益丰富。现实中,很多高维数据由多个主体各异的数据集融合而成。如何准确识别出高维数据集间的异同性成为大数据分析的目标之一。本文提出了变系数模型下的高维数据整合分析方法。该方法可以同时对多个数据集进行变量选择和系数估计,并且能 够自动识别出变量系数在数据集间的异同性。模拟结果表明本文方法在异同性识别、变量选择、系数估 计和预测等方面明显优于对比方法。在肺癌致病基因识别的应用研究中,本文方法能够识别出具有生物解释的致病基因并发现了两种亚型之间的异同性。  相似文献   

10.
The paper considers a significance test of regression variables in the high-dimensional linear regression model when the dimension of the regression variables p, together with the sample size n, tends to infinity. Under two sightly different cases, we proved that the likelihood ratio test statistic will converge in distribution to a Gaussian random variable, and the explicit expressions of the asymptotical mean and covariance are also obtained. The simulations demonstrate that our high-dimensional likelihood ratio test method outperforms those using the traditional methods in analyzing high-dimensional data.  相似文献   

11.
Variable selection methods have been widely used in the analysis of high-dimensional data, for example, gene expression microarray data and single nucleotide polymorphism data. A special feature of the genomic data is that genes participating in a common metabolic pathway or sharing a similar biological function tend to have high correlations. The collinearity naturally embedded in these data requires special handling, which cannot be provided by existing variable selection methods. In this paper, we propose a set of new methods to select variables in correlated data. The new methods follow the forward selection procedure of least angle regression (LARS) but conduct grouping and selecting at the same time. The methods specially work when no prior information on group structures of data is available. Simulations and real examples show that our proposed methods often outperform the existing variable selection methods, including LARS and elastic net, in terms of both reducing prediction error and preserving sparsity of representation.  相似文献   

12.
The L1-type regularization provides a useful tool for variable selection in high-dimensional regression modeling. Various algorithms have been proposed to solve optimization problems for L1-type regularization. Especially the coordinate descent algorithm has been shown to be effective in sparse regression modeling. Although the algorithm shows a remarkable performance to solve optimization problems for L1-type regularization, it suffers from outliers, since the procedure is based on the inner product of predictor variables and partial residuals obtained from a non-robust manner. To overcome this drawback, we propose a robust coordinate descent algorithm, especially focusing on the high-dimensional regression modeling based on the principal components space. We show that the proposed robust algorithm converges to the minimum value of its objective function. Monte Carlo experiments and real data analysis are conducted to examine the efficiency of the proposed robust algorithm. We observe that our robust coordinate descent algorithm effectively performs for the high-dimensional regression modeling even in the presence of outliers.  相似文献   

13.
Partially linear varying coefficient models (PLVCMs) with heteroscedasticity are considered in this article. Based on composite quantile regression, we develop a weighted composite quantile regression (WCQR) to estimate the non parametric varying coefficient functions and the parametric regression coefficients. The WCQR is augmented using a data-driven weighting scheme. Moreover, the asymptotic normality of proposed estimators for both the parametric and non parametric parts are studied explicitly. In addition, by comparing the asymptotic relative efficiency theoretically and numerically, WCQR method all outperforms the CQR method and some other estimate methods. To achieve sparsity with high-dimensional covariates, we develop a variable selection procedure to select significant parametric components for the PLVCM and prove the method possessing the oracle property. Both simulations and data analysis are conducted to illustrate the finite-sample performance of the proposed methods.  相似文献   

14.
High-dimensional sparse modeling with censored survival data is of great practical importance, as exemplified by applications in high-throughput genomic data analysis. In this paper, we propose a class of regularization methods, integrating both the penalized empirical likelihood and pseudoscore approaches, for variable selection and estimation in sparse and high-dimensional additive hazards regression models. When the number of covariates grows with the sample size, we establish asymptotic properties of the resulting estimator and the oracle property of the proposed method. It is shown that the proposed estimator is more efficient than that obtained from the non-concave penalized likelihood approach in the literature. Based on a penalized empirical likelihood ratio statistic, we further develop a nonparametric likelihood approach for testing the linear hypothesis of regression coefficients and constructing confidence regions consequently. Simulation studies are carried out to evaluate the performance of the proposed methodology and also two real data sets are analyzed.  相似文献   

15.
The problem of modeling the relationship between a set of covariates and a multivariate response with correlated components often arises in many areas of research such as genetics, psychometrics, signal processing. In the linear regression framework, such task can be addressed using a number of existing methods. In the high-dimensional sparse setting, most of these methods rely on the idea of penalization in order to efficiently estimate the regression matrix. Examples of such methods include the lasso, the group lasso, the adaptive group lasso or the simultaneous variable selection (SVS) method. Crucially, a suitably chosen penalty also allows for an efficient exploitation of the correlation structure within the multivariate response. In this paper we introduce a novel variant of such method called the adaptive SVS, which is closely linked with the adaptive group lasso. Via a simulation study we investigate its performance in the high-dimensional sparse regression setting. We provide a comparison with a number of other popular methods under different scenarios and show that the adaptive SVS is a powerful tool for efficient recovery of signal in such setting. The methods are applied to genetic data.  相似文献   

16.
In high-dimensional regression problems regularization methods have been a popular choice to address variable selection and multicollinearity. In this paper we study bridge regression that adaptively selects the penalty order from data and produces flexible solutions in various settings. We implement bridge regression based on the local linear and quadratic approximations to circumvent the nonconvex optimization problem. Our numerical study shows that the proposed bridge estimators are a robust choice in various circumstances compared to other penalized regression methods such as the ridge, lasso, and elastic net. In addition, we propose group bridge estimators that select grouped variables and study their asymptotic properties when the number of covariates increases along with the sample size. These estimators are also applied to varying-coefficient models. Numerical examples show superior performances of the proposed group bridge estimators in comparisons with other existing methods.  相似文献   

17.
To perform variable selection in expectile regression, we introduce the elastic-net penalty into expectile regression and propose an elastic-net penalized expectile regression (ER-EN) model. We then adopt the semismooth Newton coordinate descent (SNCD) algorithm to solve the proposed ER-EN model in high-dimensional settings. The advantages of ER-EN model are illustrated via extensive Monte Carlo simulations. The numerical results show that the ER-EN model outperforms the elastic-net penalized least squares regression (LSR-EN), the elastic-net penalized Huber regression (HR-EN), the elastic-net penalized quantile regression (QR-EN) and conventional expectile regression (ER) in terms of variable selection and predictive ability, especially for asymmetric distributions. We also apply the ER-EN model to two real-world applications: relative location of CT slices on the axial axis and metabolism of tacrolimus (Tac) drug. Empirical results also demonstrate the superiority of the ER-EN model.  相似文献   

18.
High-dimensional data arise frequently in modern applications such as biology, chemometrics, economics, neuroscience and other scientific fields. The common features of high-dimensional data are that many of predictors may not be significant, and there exists high correlation among predictors. Generalized linear models, as the generalization of linear models, also suffer from the collinearity problem. In this paper, combining the nonconvex penalty and ridge regression, we propose the weighted elastic-net to deal with the variable selection of generalized linear models on high dimension and give the theoretical properties of the proposed method with a diverging number of parameters. The finite sample behavior of the proposed method is illustrated with simulation studies and a real data example.  相似文献   

19.
Bayesian model averaging (BMA) is an effective technique for addressing model uncertainty in variable selection problems. However, current BMA approaches have computational difficulty dealing with data in which there are many more measurements (variables) than samples. This paper presents a method for combining ?1 regularization and Markov chain Monte Carlo model composition techniques for BMA. By treating the ?1 regularization path as a model space, we propose a method to resolve the model uncertainty issues arising in model averaging from solution path point selection. We show that this method is computationally and empirically effective for regression and classification in high-dimensional data sets. We apply our technique in simulations, as well as to some applications that arise in genomics.  相似文献   

20.
In the problem of selecting variables in a multivariate linear regression model, we derive new Bayesian information criteria based on a prior mixing a smooth distribution and a delta distribution. Each of them can be interpreted as a fusion of the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Inheriting their asymptotic properties, our information criteria are consistent in variable selection in both the large-sample and the high-dimensional asymptotic frameworks. In numerical simulations, variable selection methods based on our information criteria choose the true set of variables with high probability in most cases.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号