首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 375 毫秒
Model selection strategies play an important, if not explicit, role in quantitative research. The inferential properties of these strategies are largely unknown, therefore, there is little basis for recommending (or avoiding) any particular set of strategies. In this paper, we evaluate several commonly used model selection procedures [Bayesian information criterion (BIC), adjusted R 2, Mallows’ C p, Akaike information criteria (AIC), AICc, and stepwise regression] using Monte-Carlo simulation of model selection when the true data generating processes (DGP) are known.

We find that the ability of these selection procedures to include important variables and exclude irrelevant variables increases with the size of the sample and decreases with the amount of noise in the model. None of the model selection procedures do well in small samples, even when the true DGP is largely deterministic; thus, data mining in small samples should be avoided entirely. Instead, the implicit uncertainty in model specification should be explicitly discussed. In large samples, BIC is better than the other procedures at correctly identifying most of the generating processes we simulated, and stepwise does almost as well. In the absence of strong theory, both BIC and stepwise appear to be reasonable model selection strategies in large samples. Under the conditions simulated, adjusted R 2, Mallows’ C p AIC, and AICc are clearly inferior and should be avoided.  相似文献   

An adaptive variable selection procedure is proposed which uses an adaptive test along with a stepwise procedure to select variables for a multiple regression model. We compared this adaptive stepwise procedure to methods that use Akaike's information criterion, Schwartz's information criterion, and Sawa's information criterion. The simulation studies demonstrated that the adaptive stepwise method is more effective than the traditional variable selection methods if the error distribution is not normally distributed. If the error distribution is known to be normally distributed, the variable selection method based on Sawa's information criteria appears to be superior to the other methods. Unless the error distribution is known to be normally distributed, the adaptive stepwise method is recommended.  相似文献   

For stepwise regression and discriminant analysis the parameters F in and F out govern the inclusion and deletion of variables. The candidate variable with the biggest F—ratio is included if this exceeds F inthe included variable with the smallest F—ratio is deleted if this is less than F in If F inF out; then return to a previous subset size implies improvement in the criterion measure. This result also holds for a generalization, stepwise multivariate analysis, which includes stepwise regression and discriminant analysis as special cases

Eliminations do not occur if forward regression and backward elimination yield the same sequence of subsets. Conversely, there is a more liberal stepping rule which always eliminates if the two sequences differ.  相似文献   

In some industrial applications, the quality of a process or product is characterized by a relationship between the response variable and one or more independent variables which is called as profile. There are many approaches for monitoring different types of profiles in the literature. Most researchers assume that the response variable follows a normal distribution. However, this assumption may be violated in many cases. The most likely situation is when the response variable follows a distribution from generalized linear models (GLMs). For example, when the response variable is the number of defects in a certain area of a product, the observations follow Poisson distribution and ignoring this fact will cause misleading results. In this paper, three methods including a T2-based method, likelihood ratio test (LRT) method and F method are developed and modified in order to be applied in monitoring GLM regression profiles in Phase I. The performance of the proposed methods is analysed and compared for the special case that the response variable follows Poisson distribution. A simulation study is done regarding the probability of the signal criterion. Results show that the LRT method performs better than two other methods and the F method performs better than the T2-based method in detecting either small or large step shifts as well as drifts. Moreover, the F method performs better than the other two methods, and the LRT method performs poor in comparison with the F and T2-based methods in detecting outliers. A real case, in which the size and number of agglomerates ejected from a volcano in successive days form the GLM profile, is illustrated and the proposed methods are applied to determine whether the number of agglomerates of each size is under statistical control or not. Results showed that the proposed methods could handle the mentioned situation and distinguish the out-of-control conditions.  相似文献   

In this paper, we propose a novel Max-Relevance and Min-Common-Redundancy criterion for variable selection in linear models. Considering that the ensemble approach for variable selection has been proven to be quite effective in linear regression models, we construct a variable selection ensemble (VSE) by combining the presented stochastic correlation coefficient algorithm with a stochastic stepwise algorithm. We conduct extensive experimental comparison of our algorithm and other methods using two simulation studies and four real-life data sets. The results confirm that the proposed VSE leads to promising improvement on variable selection and regression accuracy.  相似文献   

Abstract. It is quite common in epidemiology that we wish to assess the quality of estimators on a particular set of information, whereas the estimators may use a larger set of information. Two examples are studied: the first occurs when we construct a model for an event which happens if a continuous variable is above a certain threshold. We can compare estimators based on the observation of only the event or on the whole continuous variable. The other example is that of predicting the survival based only on survival information or using in addition information on a disease. We develop modified Akaike information criterion (AIC) and Likelihood cross‐validation (LCV) criteria to compare estimators in this non‐standard situation. We show that a normalized difference of AIC has a bias equal to o ( n ? 1 ) if the estimators are based on well‐specified models; a normalized difference of LCV always has a bias equal to o ( n ? 1 ). A simulation study shows that both criteria work well, although the normalized difference of LCV tends to be better and is more robust. Moreover in the case of well‐specified models the difference of risks boils down to the difference of statistical risks which can be rather precisely estimated. For ‘compatible’ models the difference of risks is often the main term but there can also be a difference of mis‐specification risks.  相似文献   

Partial linear single-index model (PLSIM) has both the flexibility of nonparametric treatment and interpretability of linear term, yet existing literatures about it mainly focused on mean regression, and quantile regression analysis is scarce. Based on free knot spline approximation, we apply asymmetric Laplace distribution to implement Bayesian quantile regression, and perform variable selection in linear term and index vector via binary indicators. Our approach is exempt from regularity conditions in frequentist method, and could execute variable selection and quantile regression under mutual posterior correction, which is also the first work to implement them jointly for PLSIM in fully Bayesian framework. The numerical simulation manifests the superiority of our approach to previous methods, which embodied in better efficiency of variable selection, index vector estimates and link function approximation with different error distributions. For illustration of its application, we build a power consumption model of A2/O process in wastewater treatment and emphatically analyze the impact of water quality factors.  相似文献   

In data sets with many predictors, algorithms for identifying a good subset of predictors are often used. Most such algorithms do not allow for any relationships between predictors. For example, stepwise regression might select a model containing an interaction AB but neither main effect A or B. This paper develops mathematical representations of this and other relations between predictors, which may then be incorporated in a model selection procedure. A Bayesian approach that goes beyond the standard independence prior for variable selection is adopted, and preference for certain models is interpreted as prior information. Priors relevant to arbitrary interactions and polynomials, dummy variables for categorical factors, competing predictors, and restrictions on the size of the models are developed. Since the relations developed are for priors, they may be incorporated in any Bayesian variable selection algorithm for any type of linear model. The application of the methods here is illustrated via the stochastic search variable selection algorithm of George and McCulloch (1993), which is modified to utilize the new priors. The performance of the approach is illustrated with two constructed examples and a computer performance dataset.  相似文献   

In high-dimensional setting, componentwise L2boosting has been used to construct sparse model that performs well, but it tends to select many ineffective variables. Several sparse boosting methods, such as, SparseL2Boosting and Twin Boosting, have been proposed to improve the variable selection of L2boosting algorithm. In this article, we propose a new general sparse boosting method (GSBoosting). The relations are established between GSBoosting and other well known regularized variable selection methods in the orthogonal linear model, such as adaptive Lasso, hard thresholds, etc. Simulation results show that GSBoosting has good performance in both prediction and variable selection.  相似文献   

Although the t-type estimator is a kind of M-estimator with scale optimization, it has some advantages over the M-estimator. In this article, we first propose a t-type joint generalized linear model as a robust extension to the classical joint generalized linear models for modeling data containing extreme or outlying observations. Next, we develop a t-type pseudo-likelihood (TPL) approach, which can be viewed as a robust version to the existing pseudo-likelihood (PL) approach. To determine which variables significantly affect the variance of the response variable, we then propose a unified penalized maximum TPL method to simultaneously select significant variables for the mean and dispersion models in t-type joint generalized linear models. Thus, the proposed variable selection method can simultaneously perform parameter estimation and variable selection in the mean and dispersion models. With appropriate selection of the tuning parameters, we establish the consistency and the oracle property of the regularized estimators. Simulation studies are conducted to illustrate the proposed methods.  相似文献   

Correlation is not causation. Spurious association between X and Y may be due to a confounding variable W. Statisticians may adjust for W using a variety of techniques. This article presents the results of simulations conducted to assess the performance of these techniques under various, elementary, data-generating processes. The results indicate that no technique is best overall and that specific techniques should be selected based on the particulars of the data-generating process. Here, we show how causal graphs can guide the selection or design of techniques for statistical adjustment. R programs are provided for researchers interested in generalization.  相似文献   


In this article, we propose a more general criterion called Sp -criterion, for subset selection in the multiple linear regression Model. Many subset selection methods are based on the Least Squares (LS) estimator of β, but whenever the data contain an influential observation or the distribution of the error variable deviates from normality, the LS estimator performs ‘poorly’ and hence a method based on this estimator (for example, Mallows’ Cp -criterion) tends to select a ‘wrong’ subset. The proposed method overcomes this drawback and its main feature is that it can be used with any type of estimator (either the LS estimator or any robust estimator) of β without any need for modification of the proposed criterion. Moreover, this technique is operationally simple to implement as compared to other existing criteria. The method is illustrated with examples.  相似文献   

Using a forward selection procedure for selecting the best subset of regression variables involves the calculation of critical values (cutoffs) for an F-ratio at each step of a multistep search process. On dropping the restrictive (unrealistic) assumptions used in previous works, the null distribution of the F-ratio depends on unknown regression parameters for the variables already included in the subset. For the case of known σ, by conditioning the F-ratio on the set of regressors included so far and also on the observed (estimated) values of their regression coefficients, we obtain a forward selection procedure whose stepwise type I error does not depend on the unknown (nuisance) parameters. A numerical example with an orthogonal design matrix illustrates the difference between conditional cutoffs, cutoffs for the centralF-distribution, and cutoffs suggested by Pope and Webster.  相似文献   


In some situations, for example, in biology or psychology studies, we wish to determine whether the linear relationship between response variable and predictor variables differs in two populations. The analysis of the covariance (ANCOVA) or, equivalently, the partial F-test approaches are the commonly used methods. In this study, the asymptotic distribution for the difference between two independent regression coefficients was established. The proposed method was used to derive the asymptotic confidence set for the difference between coefficients and hypothesis testing for the equality of the two regression models. Then a simulation study was conducted to compare the proposed method with the partial F method. The performance of the new method was comparable with that of the partial F method.  相似文献   

Stepwise methods for variable selection are frequently used to determine the predictors of an outcome in generalized linear models. Although it is widely used within the scientific community, it is well known that the tests on the explained deviance of the selected model are biased. This arises from the fact that the traditional test statistics upon which these methods are based were intended for testing pre-specified hypotheses; instead, the tested model is selected through a data-driven procedure. A multiplicity problem therefore arises. In this work, we define and discuss a nonparametric procedure to adjust the p-value of the selected model of any stepwise selection method. The unbiasedness and consistency of the method is also proved. A simulation study shows the validity of this procedure. Theoretical differences with previous works in the same field are also discussed.  相似文献   

We consider the problem of deciding which of a set of p independent variables x1 X2J xs we are to regard as being functionally involved in the mean of a dependent normal random variable Y and estimating E( Y) in terms of the chosen x's. This mean is an unknown function (assumed to be doubly differentiable) of some or all of the x's, so that the problem is of wide relevance. We approximate to the hypersurface in two different ways, and select within each approximation:

(a)For the situation where the mean of Y is assumed to be a linear function of the x's, we use ono of the optimum methods of selection.

(b)More generally, in the space of the X's the function will be approximately linear in a relatively small region. Accordingly this p-dimensional space is subdivided into smaller regions by a clustering procedure, and a hyperplane if fitted with in each region to aproximate to the unknown responce surface.An adaption of an optimum-regressor-selection procedure is then used to assist in the selection of the regressors

Approximate F tests are given to choose between models, including deciding how many x's to retain. Alternatively: the application of Akaike's Extended Maximum Likelihood Principle provides another way of choosing between the models and of selecting regressor variables. The methods are applied to data on glass manufacture.  相似文献   


Inflated data are prevalent in many situations and a variety of inflated models with extensions have been derived to fit data with excessive counts of some particular responses. The family of information criteria (IC) has been used to compare the fit of models for selection purposes. Yet despite the common use in statistical applications, there are not too many studies evaluating the performance of IC in inflated models. In this study, we studied the performance of IC for data with dual-inflated data. The new zero- and K-inflated Poisson (ZKIP) regression model and conventional inflated models including Poisson regression and zero-inflated Poisson (ZIP) regression were fitted for dual-inflated data and the performance of IC were compared. The effect of sample sizes and the proportions of inflated observations towards selection performance were also examined. The results suggest that the Bayesian information criterion (BIC) and consistent Akaike information criterion (CAIC) are more accurate than the Akaike information criterion (AIC) in terms of model selection when the true model is simple (i.e. Poisson regression (POI)). For more complex models, such as ZIP and ZKIP, the AIC was consistently better than the BIC and CAIC, although it did not reach high levels of accuracy when sample size and the proportion of zero observations were small. The AIC tended to over-fit the data for the POI, whereas the BIC and CAIC tended to under-parameterize the data for ZIP and ZKIP. Therefore, it is desirable to study other model selection criteria for dual-inflated data with small sample size.  相似文献   

将变量选择引入空间计量模型,讨论具有自回归误差项的空间自回归模型的变量选择问题。在残差非正态独立同分布的条件下,通过最大化信息熵,提出空间信息准则,并证明其在该模型变量选择中具有一致性。模拟研究结果表明:无论对单个系数还是对全部系数,空间信息准则都能很好识别,且与经典的赤池准则相比具有较大的优势。因此,空间信息准则是一种更为有效的变量选择方法。  相似文献   


In this paper, we investigate the objective function and deflation process for sparse Partial Least Squares (PLS) regression with multiple components. While many have considered variations on the objective for sparse PLS, the deflation process for sparse PLS has not received as much attention. Our work highlights a flaw in the Statistically Inspired Modification of Partial Least Squares (SIMPLS) deflation method when applied in sparse PLS regression. We also consider the Nonlinear Iterative Partial Least Squares (NIPALS) deflation in sparse PLS regression. To remedy the flaw in the SIMPLS method, we propose a new sparse PLS method wherein the direction vectors are constrained to be sparse and lie in a chosen subspace. We give insight into this new PLS procedure and show through examples and simulation studies that the proposed technique can outperform alternative sparse PLS techniques in coefficient estimation. Moreover, our analysis reveals a simple renormalization step that can be used to improve the estimation of sparse PLS direction vectors generated using any convex relaxation method.  相似文献   

This note discusses a problem that might occur when forward stepwise regression is used for variable selection and among the candidate variables is a categorical variable with more than two categories. Most software packages (such as SAS, SPSSx, BMDP) include special programs for performing stepwise regression. The user of these programs has to code categorical variables with dummy variables. In this case the forward selection might wrongly indicate that a categorical variable with more than two categories is nonsignificant. This is a disadvantage of the forward selection compared with the backward elimination method. A way to avoid the problem would be to test in a single step all dummy variables corresponding to the same categorical variable rather than one dummy variable at a time, such as in the analysis of covariance. This option, however, is not available in forward stepwise procedures, except for stepwise logistic regression in BMDP. A practical possibility is to repeat the forward stepwise regression and change the reference categories each time.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号