首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
In the problem of selecting variables in a multivariate linear regression model, we derive new Bayesian information criteria based on a prior mixing a smooth distribution and a delta distribution. Each of them can be interpreted as a fusion of the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Inheriting their asymptotic properties, our information criteria are consistent in variable selection in both the large-sample and the high-dimensional asymptotic frameworks. In numerical simulations, variable selection methods based on our information criteria choose the true set of variables with high probability in most cases.  相似文献   

2.
This paper is concerned with selection of explanatory variables in generalized linear models (GLM). The class of GLM's is quite large and contains e.g. the ordinary linear regression, the binary logistic regression, the probit model and Poisson regression with linear or log-linear parameter structure. We show that, through an approximation of the log likelihood and a certain data transformation, the variable selection problem in a GLM can be converted into variable selection in an ordinary (unweighted) linear regression model. As a consequence no specific computer software for variable selection in GLM's is needed. Instead, some suitable variable selection program for linear regression can be used. We also present a simulation study which shows that the log likelihood approximation is very good in many practical situations. Finally, we mention briefly possible extensions to regression models outside the class of GLM's.  相似文献   

3.
We consider the problem of variable selection for a class of varying coefficient models with instrumental variables. We focus on the case that some covariates are endogenous variables, and some auxiliary instrumental variables are available. An instrumental variable based variable selection procedure is proposed by using modified smooth-threshold estimating equations (SEEs). The proposed procedure can automatically eliminate the irrelevant covariates by setting the corresponding coefficient functions as zero, and simultaneously estimate the nonzero regression coefficients by solving the smooth-threshold estimating equations. The proposed variable selection procedure avoids the convex optimization problem, and is flexible and easy to implement. Simulation studies are carried out to assess the performance of the proposed variable selection method.  相似文献   

4.
This paper proposes a variable selection method for detecting abnormal items based on the T2 test when the observations on abnormal items are available. Based on the unbiased estimates of the powers for all subsets of variables, the variable selection method selects the subset of variables that maximizes the power estimate. Since more than one subsets of variables maximize the power estimate frequently, the averaged p-value of the rejected items is used as a second criterion. Although the performance of the method depends on the sample size for the abnormal items and the true power values for all subsets of variables, numerical experiments show the effectiveness of the proposed method. Since normal and abnormal items are simulated using one-factor and two-factor models, basic properties of the power functions for the models are investigated.  相似文献   

5.
We present APproximated Exhaustive Search (APES), which enables fast and approximated exhaustive variable selection in Generalised Linear Models (GLMs). While exhaustive variable selection remains as the gold standard in many model selection contexts, traditional exhaustive variable selection suffers from computational feasibility issues. More precisely, there is often a high cost associated with computing maximum likelihood estimates (MLE) for all subsets of GLMs. Efficient algorithms for exhaustive searches exist for linear models, most notably the leaps‐and‐bound algorithm and, more recently, the mixed integer optimisation (MIO) algorithm. The APES method learns from observational weights in a generalised linear regression super‐model and reformulates the GLM problem as a linear regression problem. In this way, APES can approximate a true exhaustive search in the original GLM space. Where exhaustive variable selection is not computationally feasible, we propose a best‐subset search, which also closely approximates a true exhaustive search. APES is made available in both as a standalone R package as well as part of the already existing mplot package.  相似文献   

6.
Nonparametric seemingly unrelated regression provides a powerful alternative to parametric seemingly unrelated regression for relaxing the linearity assumption. The existing methods are limited, particularly with sharp changes in the relationship between the predictor variables and the corresponding response variable. We propose a new nonparametric method for seemingly unrelated regression, which adopts a tree-structured regression framework, has satisfiable prediction accuracy and interpretability, no restriction on the inclusion of categorical variables, and is less vulnerable to the curse of dimensionality. Moreover, an important feature is constructing a unified tree-structured model for multivariate data, even though the predictor variables corresponding to the response variable are entirely different. This unified model can offer revelatory insights such as underlying economic meaning. We propose the key factors of tree-structured regression, which are an impurity function detecting complex nonlinear relationships between the predictor variables and the response variable, split rule selection with negligible selection bias, and tree size determination solving underfitting and overfitting problems. We demonstrate our proposed method using simulated data and illustrate it using data from the Korea stock exchange sector indices.  相似文献   

7.
A polynomial functional relationship with errors in both variables can be consistently estimated by constructing an ordinary least squares estimator for the regression coefficients, assuming hypothetically the latent true regressor variable to be known, and then adjusting for the errors. If normality of the error variables can be assumed, the estimator can be simplified considerably. Only the variance of the errors in the regressor variable and its covariance with the errors of the response variable need to be known. If the variance of the errors in the dependent variable is also known, another estimator can be constructed.  相似文献   

8.
Interaction is very common in reality, but has received little attention in logistic regression literature. This is especially true for higher-order interactions. In conventional logistic regression, interactions are typically ignored. We propose a model selection procedure by implementing an association rules analysis. We do this by (1) exploring the combinations of input variables which have significant impacts to response (via association rules analysis); (2) selecting the potential (low- and high-order) interactions; (3) converting these potential interactions into new dummy variables; and (4) performing variable selections among all the input variables and the newly created dummy variables (interactions) to build up the optimal logistic regression model. Our model selection procedure establishes the optimal combination of main effects and potential interactions. The comparisons are made through thorough simulations. It is shown that the proposed method outperforms the existing methods in all cases. A real-life example is discussed in detail to demonstrate the proposed method.  相似文献   

9.
Many tree algorithms have been developed for regression problems. Although they are regarded as good algorithms, most of them suffer from loss of prediction accuracy when there are many irrelevant variables and the number of predictors exceeds the number of observations. We propose the multistep regression tree with adaptive variable selection to handle this problem. The variable selection step and the fitting step comprise the multistep method.

The multistep generalized unbiased interaction detection and estimation (GUIDE) with adaptive forward selection (fg) algorithm, as a variable selection tool, performs better than some of the well-known variable selection algorithms such as efficacy adaptive regression tube hunting (EARTH), FSR (false selection rate), LSCV (least squares cross-validation), and LASSO (least absolute shrinkage and selection operator) for the regression problem. The results based on simulation study show that fg outperforms other algorithms in terms of selection result and computation time. It generally selects the important variables correctly with relatively few irrelevant variables, which gives good prediction accuracy with less computation time.  相似文献   

10.
This paper considers variable and factor selection in factor analysis. We treat the factor loadings for each observable variable as a group, and introduce a weighted sparse group lasso penalty to the complete log-likelihood. The proposal simultaneously selects observable variables and latent factors of a factor analysis model in a data-driven fashion; it produces a more flexible and sparse factor loading structure than existing methods. For parameter estimation, we derive an expectation-maximization algorithm that optimizes the penalized log-likelihood. The tuning parameters of the procedure are selected by a likelihood cross-validation criterion that yields satisfactory results in various simulation settings. Simulation results reveal that the proposed method can better identify the possibly sparse structure of the true factor loading matrix with higher estimation accuracy than existing methods. A real data example is also presented to demonstrate its performance in practice.  相似文献   

11.
文章基于解释变量与被解释变量之间的互信息提出一种新的变量选择方法:MI-SIS。该方法可以处理解释变量数目p远大于观测样本量n的超高维问题,即p=O(exp(nε))ε>0。另外,该方法是一种不依赖于模型假设的变量选择方法。数值模拟和实证研究表明,MI-SIS方法在小样本情形下能够有效地发现微弱信号。  相似文献   

12.
In prediction problems both response and covariates may have high correlation with a second group of influential regressors, that can be considered as background variables. An important challenge is to perform variable selection and importance assessment among the covariates in the presence of these variables. A clinical example is the prediction of the lean body mass (response) from bioimpedance (covariates), where anthropometric measures play the role of background variables. We introduce a reduced dataset in which the variables are defined as the residuals with respect to the background, and perform variable selection and importance assessment both in linear and random forest models. Using a clinical dataset of multi-frequency bioimpedance, we show the effectiveness of this method to select the most relevant predictors of the lean body mass beyond anthropometry.  相似文献   

13.
闫懋博  田茂再 《统计研究》2021,38(1):147-160
Lasso等惩罚变量选择方法选入模型的变量数受到样本量限制。文献中已有研究变量系数显著性的方法舍弃了未选入模型的变量含有的信息。本文在变量数大于样本量即p>n的高维情况下,使用随机化bootstrap方法获得变量权重,在计算适应性Lasso时构建选择事件的条件分布并剔除系数不显著的变量,以得到最终估计结果。本文的创新点在于提出的方法突破了适应性Lasso可选变量数的限制,当观测数据含有大量干扰变量时能够有效地识别出真实变量与干扰变量。与现有的惩罚变量选择方法相比,多种情境下的模拟研究展示了所提方法在上述两个问题中的优越性。实证研究中对NCI-60癌症细胞系数据进行了分析,结果较以往文献有明显改善。  相似文献   

14.
The beta regression models are commonly used by practitioners to model variables that assume values in the standard unit interval (0, 1). In this paper, we consider the issue of variable selection for beta regression models with varying dispersion (VBRM), in which both the mean and the dispersion depend upon predictor variables. Based on a penalized likelihood method, the consistency and the oracle property of the penalized estimators are established. Following the coordinate descent algorithm idea of generalized linear models, we develop new variable selection procedure for the VBRM, which can efficiently simultaneously estimate and select important variables in both mean model and dispersion model. Simulation studies and body fat data analysis are presented to illustrate the proposed methods.  相似文献   

15.
Numerous variable selection methods rely on a two-stage procedure, where a sparsity-inducing penalty is used in the first stage to predict the support, which is then conveyed to the second stage for estimation or inference purposes. In this framework, the first stage screens variables to find a set of possibly relevant variables and the second stage operates on this set of candidate variables, to improve estimation accuracy or to assess the uncertainty associated to the selection of variables. We advocate that more information can be conveyed from the first stage to the second one: we use the magnitude of the coefficients estimated in the first stage to define an adaptive penalty that is applied at the second stage. We give the example of an inference procedure that highly benefits from the proposed transfer of information. The procedure is precisely analyzed in a simple setting, and our large-scale experiments empirically demonstrate that actual benefits can be expected in much more general situations, with sensitivity gains ranging from 50 to 100 % compared to state-of-the-art.  相似文献   

16.
The statistical methods for variable selection and prediction could be challenging when missing covariates exist. Although multiple imputation (MI) is a universally accepted technique for solving missing data problem, how to combine the MI results for variable selection is not quite clear, because different imputations may result in different selections. The widely applied variable selection methods include the sparse partial least-squares (SPLS) method and the penalized least-squares method, e.g. the elastic net (ENet) method. In this paper, we propose an MI-based weighted elastic net (MI-WENet) method that is based on stacked MI data and a weighting scheme for each observation in the stacked data set. In the MI-WENet method, MI accounts for sampling and imputation uncertainty for missing values, and the weight accounts for the observed information. Extensive numerical simulations are carried out to compare the proposed MI-WENet method with the other competing alternatives, such as the SPLS and ENet. In addition, we applied the MI-WENet method to examine the predictor variables for the endothelial function that can be characterized by median effective dose (ED50) and maximum effect (Emax) in an ex-vivo phenylephrine-induced extension and acetylcholine-induced relaxation experiment.  相似文献   

17.
This paper considers a linear regression model with regression parameter vector β. The parameter of interest is θ= aTβ where a is specified. When, as a first step, a data‐based variable selection (e.g. minimum Akaike information criterion) is used to select a model, it is common statistical practice to then carry out inference about θ, using the same data, based on the (false) assumption that the selected model had been provided a priori. The paper considers a confidence interval for θ with nominal coverage 1 ‐ α constructed on this (false) assumption, and calls this the naive 1 ‐ α confidence interval. The minimum coverage probability of this confidence interval can be calculated for simple variable selection procedures involving only a single variable. However, the kinds of variable selection procedures used in practice are typically much more complicated. For the real‐life data presented in this paper, there are 20 variables each of which is to be either included or not, leading to 220 different models. The coverage probability at any given value of the parameters provides an upper bound on the minimum coverage probability of the naive confidence interval. This paper derives a new Monte Carlo simulation estimator of the coverage probability, which uses conditioning for variance reduction. For these real‐life data, the gain in efficiency of this Monte Carlo simulation due to conditioning ranged from 2 to 6. The paper also presents a simple one‐dimensional search strategy for parameter values at which the coverage probability is relatively small. For these real‐life data, this search leads to parameter values for which the coverage probability of the naive 0.95 confidence interval is 0.79 for variable selection using the Akaike information criterion and 0.70 for variable selection using Bayes information criterion, showing that these confidence intervals are completely inadequate.  相似文献   

18.
We consider logistic regression with covariate measurement error. Most existing approaches require certain replicates of the error‐contaminated covariates, which may not be available in the data. We propose generalized method of moments (GMM) nonparametric correction approaches that use instrumental variables observed in a calibration subsample. The instrumental variable is related to the underlying true covariates through a general nonparametric model, and the probability of being in the calibration subsample may depend on the observed variables. We first take a simple approach adopting the inverse selection probability weighting technique using the calibration subsample. We then improve the approach based on the GMM using the whole sample. The asymptotic properties are derived, and the finite sample performance is evaluated through simulation studies and an application to a real data set.  相似文献   

19.
This note discusses a problem that might occur when forward stepwise regression is used for variable selection and among the candidate variables is a categorical variable with more than two categories. Most software packages (such as SAS, SPSSx, BMDP) include special programs for performing stepwise regression. The user of these programs has to code categorical variables with dummy variables. In this case the forward selection might wrongly indicate that a categorical variable with more than two categories is nonsignificant. This is a disadvantage of the forward selection compared with the backward elimination method. A way to avoid the problem would be to test in a single step all dummy variables corresponding to the same categorical variable rather than one dummy variable at a time, such as in the analysis of covariance. This option, however, is not available in forward stepwise procedures, except for stepwise logistic regression in BMDP. A practical possibility is to repeat the forward stepwise regression and change the reference categories each time.  相似文献   

20.
Linear models with a growing number of parameters have been widely used in modern statistics. One important problem about this kind of model is the variable selection issue. Bayesian approaches, which provide a stochastic search of informative variables, have gained popularity. In this paper, we will study the asymptotic properties related to Bayesian model selection when the model dimension p is growing with the sample size n. We consider pn and provide sufficient conditions under which: (1) with large probability, the posterior probability of the true model (from which samples are drawn) uniformly dominates the posterior probability of any incorrect models; and (2) the posterior probability of the true model converges to one in probability. Both (1) and (2) guarantee that the true model will be selected under a Bayesian framework. We also demonstrate several situations when (1) holds but (2) fails, which illustrates the difference between these two properties. Finally, we generalize our results to include g-priors, and provide simulation examples to illustrate the main results.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号