期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Frequency of Selecting Noise Variables in Subset Regression Analysis: A Simulation Study

Virginia F. Flack Potter C. Chang 《The American statistician》2013,67(1):84-86

This article presents the results of a simulation study of variable selection in a multiple regression context that evaluates the frequency of selecting noise variables and the bias of the adjusted R ² of the selected variables when some of the candidate variables are authentic. It is demonstrated that for most samples a large percentage of the selected variables is noise, particularly when the number of candidate variables is large relative to the number of observations. The adjusted R ² of the selected variables is highly inflated. 相似文献

2.

Model selection procedures in social research: Monte-Carlo simulation results

Lawrence E. Raffalovich Glenn D. Deane David Armstrong Hui-Shien Tsao 《Journal of applied statistics》2008,35(10):1093-1114

Model selection strategies play an important, if not explicit, role in quantitative research. The inferential properties of these strategies are largely unknown, therefore, there is little basis for recommending (or avoiding) any particular set of strategies. In this paper, we evaluate several commonly used model selection procedures [Bayesian information criterion (BIC), adjusted R ², Mallows’ C _p, Akaike information criteria (AIC), AIC_c, and stepwise regression] using Monte-Carlo simulation of model selection when the true data generating processes (DGP) are known.

We find that the ability of these selection procedures to include important variables and exclude irrelevant variables increases with the size of the sample and decreases with the amount of noise in the model. None of the model selection procedures do well in small samples, even when the true DGP is largely deterministic; thus, data mining in small samples should be avoided entirely. Instead, the implicit uncertainty in model specification should be explicitly discussed. In large samples, BIC is better than the other procedures at correctly identifying most of the generating processes we simulated, and stepwise does almost as well. In the absence of strong theory, both BIC and stepwise appear to be reasonable model selection strategies in large samples. Under the conditions simulated, adjusted R ², Mallows’ C _p AIC, and AIC_c are clearly inferior and should be avoided. 相似文献

3.

Bayesian inference on P(X > Y) in bivariate Rayleigh model

Abbas Pak Arjun Kumar Gupta 《统计学通讯:理论与方法》2018,47(17):4095-4105

In the literature, assuming independence of random variables X and Y, statistical estimation of the stress–strength parameter R = P(X > Y) is intensively investigated. However, in some real applications, the strength variable X could be highly dependent on the stress variable Y. In this paper, unlike the common practice in the literature, we discuss on estimation of the parameter R where more realistically X and Y are dependent random variables distributed as bivariate Rayleigh model. We derive the Bayes estimates and highest posterior density credible intervals of the parameters using suitable priors on the parameters. Because there are not closed forms for the Bayes estimates, we will use an approximation based on Laplace method and a Markov Chain Monte Carlo technique to obtain the Bayes estimate of R and unknown parameters. Finally, simulation studies are conducted in order to evaluate the performances of the proposed estimators and analysis of two data sets are provided. 相似文献

4.

Multi-step quantile regression tree

《Journal of Statistical Computation and Simulation》2012,82(3):663-682

Quantile regression (QR) proposed by Koenker and Bassett [Regression quantiles, Econometrica 46(1) (1978), pp. 33–50] is a statistical technique that estimates conditional quantiles. It has been widely studied and applied to economics. Meinshausen [Quantile regression forests, J. Mach. Learn. Res. 7 (2006), pp. 983–999] proposed quantile regression forests (QRF), a non-parametric way based on random forest. QRF performs well in terms of prediction accuracy, but it struggles with noisy data sets. This motivates us to propose a multi-step QR tree method using GUIDE (Generalized, Unbiased, Interaction Detection and Estimation) made by Loh [Regression trees with unbiased variable selection and interaction detection, Statist. Sinica 12 (2002), pp. 361–386]. Our simulation study shows that the multi-step QR tree performs better than a single tree or QRF especially when it deals with data sets having many irrelevant variables. 相似文献

5.

Instrumental variable estimation in ordinal probit models with mismeasured predictors

Jing Guan Hongjian Cheng Kenneth A. Bollen D. Roland Thomas Liqun Wang 《Revue canadienne de statistique》2019,47(4):653-667

Researchers in the medical, health, and social sciences routinely encounter ordinal variables such as self‐reports of health or happiness. When modelling ordinal outcome variables, it is common to have covariates, for example, attitudes, family income, retrospective variables, measured with error. As is well known, ignoring even random error in covariates can bias coefficients and hence prejudice the estimates of effects. We propose an instrumental variable approach to the estimation of a probit model with an ordinal response and mismeasured predictor variables. We obtain likelihood‐based and method of moments estimators that are consistent and asymptotically normally distributed under general conditions. These estimators are easy to compute, perform well and are robust against the normality assumption for the measurement errors in our simulation studies. The proposed method is applied to both simulated and real data. The Canadian Journal of Statistics 47: 653–667; 2019 © 2019 Statistical Society of Canada 相似文献

6.

SimSel: a new simulation method for variable selection

《Journal of Statistical Computation and Simulation》2012,82(4):515-527

We propose a new simulation method, SimSel, for variable selection in linear and nonlinear modelling problems. SimSel works by disturbing the input data with pseudo-errors. We then study how this disturbance affects the quality of an approximative model fitted to the data. The main idea is that disturbing unimportant variables does not affect the quality of the model fit. The use of an approximative model has the advantage that the true underlying function does not need to be known and that the method becomes insensitive to model misspecifications. We demonstrate SimSel on simulated data from linear and nonlinear models and on two real data sets. The simulation studies suggest that SimSel works well in complicated situations, such as nonlinear errors-in-variable models. 相似文献

7.

Bayesian variable selection for multioutcome models through shared shrinkage

Debamita Kundu Riten Mitra Jeremy T. Gaskins 《Scandinavian Journal of Statistics》2021,48(1):295-320

Variable selection over a potentially large set of covariates in a linear model is quite popular. In the Bayesian context, common prior choices can lead to a posterior expectation of the regression coefficients that is a sparse (or nearly sparse) vector with a few nonzero components, those covariates that are most important. This article extends the “global‐local” shrinkage idea to a scenario where one wishes to model multiple response variables simultaneously. Here, we have developed a variable selection method for a K‐outcome model (multivariate regression) that identifies the most important covariates across all outcomes. The prior for all regression coefficients is a mean zero normal with coefficient‐specific variance term that consists of a predictor‐specific factor (shared local shrinkage parameter) and a model‐specific factor (global shrinkage term) that differs in each model. The performance of our modeling approach is evaluated through simulation studies and a data example. 相似文献

8.

Bayesian variable selection in a finite mixture of linear mixed-effects models

Kuo-Jung Lee Ray-Bing Chen 《Journal of Statistical Computation and Simulation》2019,89(13):2434-2453

Mixture of linear mixed-effects models has received considerable attention in longitudinal studies, including medical research, social science and economics. The inferential question of interest is often the identification of critical factors that affect the responses. We consider a Bayesian approach to select the important fixed and random effects in the finite mixture of linear mixed-effects models. To accomplish our goal, latent variables are introduced to facilitate the identification of influential fixed and random components and to classify the membership of observations in the longitudinal data. A spike-and-slab prior for the regression coefficients is adopted to sidestep the potential complications of highly collinear covariates and to handle large p and small n issues in the variable selection problems. Here we employ Markov chain Monte Carlo (MCMC) sampling techniques for posterior inferences and explore the performance of the proposed method in simulation studies, followed by an actual psychiatric data analysis concerning depressive disorder. 相似文献

9.

A novel bagging approach for variable ranking and selection via a mixed importance measure

Chun-Xia Zhang Jiang-She Zhang Guan-Wei Wang Nan-Nan Ji 《Journal of applied statistics》2018,45(10):1734-1755

At present, ensemble learning has exhibited its great power in stabilizing and enhancing the performance of some traditional variable selection methods such as lasso and genetic algorithm. In this paper, a novel bagging ensemble method called BSSW is developed to implement variable ranking and selection in linear regression models. Its main idea is to execute stepwise search algorithm on multiple bootstrap samples. In each trial, a mixed importance measure is assigned to each variable according to the order that it is selected into final model as well as the improvement of model fitting resulted from its inclusion. Based on the importance measure averaged across some bootstrapping trials, all candidate variables are ranked and then decided to be important or not. To extend the scope of application, BSSW is extended to the situation of generalized linear models. Experiments carried out with some simulated and real data indicate that BSSW achieves better performance in most studied cases when compared with several other existing methods. 相似文献

10.

The Prediction Sum of Squares as a General Measure for Regression Diagnostics

Nguyen T. Quan 《商业与经济统计学杂志》2013,31(4):501-504

Statistics that usually accompany the regression model do not provide insight into the quality of the data or the potential influence of the individual observations on the estimates. In this study, the Q² statistic is used as a criterion for detecting influential observations or outliers. The statistic is derived from the jackknifed residuals, the squared sum of which is generally known as the prediction sum of squares or PRESS. This article compares R ² with Q² and suggests that the latter be used as part of the data-quality check. It is shown, for two separate data sets obtained from regional cost of living and U.S. food industry studies, that in the presence of outliers the Q² statistic can be negative, because it is sensitive to the choice of regressors and the inclusion of influential observations. Once the outliers are dropped from the sample, the discrepancy between Q² and R ² values is negligible. 相似文献

11.

Investigation about a screening step in model selection

Willi Sauerbrei Norbert Holländer Anika Buchholz 《Statistics and Computing》2008,18(2):195-208

In many studies a large number of variables is measured and the identification of relevant variables influencing an outcome is an important task. For variable selection several procedures are available. However, focusing on one model only neglects that there usually exist other equally appropriate models. Bayesian or frequentist model averaging approaches have been proposed to improve the development of a predictor. With a larger number of variables (say more than ten variables) the resulting class of models can be very large. For Bayesian model averaging Occam’s window is a popular approach to reduce the model space. As this approach may not eliminate any variables, a variable screening step was proposed for a frequentist model averaging procedure. Based on the results of selected models in bootstrap samples, variables are eliminated before deriving a model averaging predictor. As a simple alternative screening procedure backward elimination can be used. Through two examples and by means of simulation we investigate some properties of the screening step. In the simulation study we consider situations with fifteen and 25 variables, respectively, of which seven have an influence on the outcome. With the screening step most of the uninfluential variables will be eliminated, but also some variables with a weak effect. Variable screening leads to more applicable models without eliminating models, which are more strongly supported by the data. Furthermore, we give recommendations for important parameters of the screening step. 相似文献

12.

Variable selection in the high-dimensional continuous generalized linear model with current status data

Guo-Liang Tian Lixin Song 《Journal of applied statistics》2014,41(3):467-483

In survival studies, current status data are frequently encountered when some individuals in a study are not successively observed. This paper considers the problem of simultaneous variable selection and parameter estimation in the high-dimensional continuous generalized linear model with current status data. We apply the penalized likelihood procedure with the smoothly clipped absolute deviation penalty to select significant variables and estimate the corresponding regression coefficients. With a proper choice of tuning parameters, the resulting estimator is shown to be a root n/p_n-consistent estimator under some mild conditions. In addition, we show that the resulting estimator has the same asymptotic distribution as the estimator obtained when the true model is known. The finite sample behavior of the proposed estimator is evaluated through simulation studies and a real example. 相似文献

13.

Pseudo latent models: Goodness of fit measures, residuals, estimation, testing, and simulation

Olaf Hübler 《Statistical Papers》1997,38(3):271-285

Binary response models consider pseudo-R ² measures which are not based on residuals while several concepts of residuals were developed for tests. In this paper the endogenous variable of the latent model corresponding to the binary observable model is substituted by a pseudo variable. Then goodness of fit measures and tests can be based on a joint concept of residuals as for linear models. Different kinds of residuals based on probit ML estimates are employed. The analytical investigations and the simulation results lead to the recommendation to use standardized residuals where there is no difference between observed and generalized residuals. In none of the investigated situations this estimator is far away from the best result. While in large samples all considered estimators are very similar, small sample properties speak in favour of residuals which are modifications of those suggested in the literature. An empirical application demonstrates that it is not necessary to develop new testing procedures for the observable models with dichotomous regressands. Well-know approaches for linear models with continuous endogenous variables which are implemented in usual econometric packages can be used for pseudo latent models. An erratum to this article is available at . 相似文献

14.

The estimation of R 2 and adjusted R 2 in incomplete data sets using multiple imputation

Ofer Harel 《Journal of applied statistics》2009,36(10):1109-1118

The coefficient of determination, known also as the R ², is a common measure in regression analysis. Many scientists use the R ² and the adjusted R ² on a regular basis. In most cases, the researchers treat the coefficient of determination as an index of ‘usefulness’ or ‘goodness of fit,’ and in some cases, they even treat it as a model selection tool. In cases in which the data is incomplete, most researchers and common statistical software will use complete case analysis in order to estimate the R ², a procedure that might lead to biased results. In this paper, I introduce the use of multiple imputation for the estimation of R ² and adjusted R ² in incomplete data sets. I illustrate my methodology using a biomedical example. 相似文献

15.

A Note on Screening Regression Equations

David A. Freedman Professor David A. Freedman Professor 《The American statistician》2013,67(2):152-155

Consider developing a regression model in a context where substantive theory is weak. To focus on an extreme case, suppose that in fact there is no relationship between the dependent variable and the explanatory variables. Even so, if there are many explanatory variables, the R ² will be high. If explanatory variables with small t statistics are dropped and the equation refitted, the R ² will stay high and the overall F will become highly significant. This is demonstrated by simulation and by asymptotic calculation. 相似文献

16.

Doubly sparse regression incorporating graphical structure among predictors

Matthew Stephenson R. Ayesha Ali Gerarda A. Darlington 《Revue canadienne de statistique》2019,47(4):729-747

Recent research has demonstrated that information learned from building a graphical model on the predictor set of a regularized linear regression model can be leveraged to improve prediction of a continuous outcome. In this article, we present a new model that encourages sparsity at both the level of the regression coefficients and the level of individual contributions in a decomposed representation. This model provides parameter estimates with a finite sample error bound and exhibits robustness to errors in the input graph structure. Through a simulation study and the analysis of two real data sets, we demonstrate that our model provides a predictive benefit when compared to previously proposed models. Furthermore, it is a highly flexible model that provides a unified framework for the fitting of many commonly used regularized regression models. The Canadian Journal of Statistics 47: 729–747; 2019 © 2019 Statistical Society of Canada 相似文献

17.

Quantifying R 2 bias in the presence of measurement error

Karl D. Majeske Terri Lynch-Caris Janet Brelin-Fornari 《Journal of applied statistics》2010,37(4):667-677

相似文献

18.

Model-averaged ℓ1 regularization using Markov chain Monte Carlo model composition

《Journal of Statistical Computation and Simulation》2012,82(6):1090-1101

Bayesian model averaging (BMA) is an effective technique for addressing model uncertainty in variable selection problems. However, current BMA approaches have computational difficulty dealing with data in which there are many more measurements (variables) than samples. This paper presents a method for combining ?₁ regularization and Markov chain Monte Carlo model composition techniques for BMA. By treating the ?₁ regularization path as a model space, we propose a method to resolve the model uncertainty issues arising in model averaging from solution path point selection. We show that this method is computationally and empirically effective for regression and classification in high-dimensional data sets. We apply our technique in simulations, as well as to some applications that arise in genomics. 相似文献

19.

The Loss Rank Criterion for Variable Selection in Linear Regression Analysis

MINH‐NGOC TRAN 《Scandinavian Journal of Statistics》2011,38(3):466-479

Abstract. Lasso and other regularization procedures are attractive methods for variable selection, subject to a proper choice of shrinkage parameter. Given a set of potential subsets produced by a regularization algorithm, a consistent model selection criterion is proposed to select the best one among this preselected set. The approach leads to a fast and efficient procedure for variable selection, especially in high‐dimensional settings. Model selection consistency of the suggested criterion is proven when the number of covariates d is fixed. Simulation studies suggest that the criterion still enjoys model selection consistency when d is much larger than the sample size. The simulations also show that our approach for variable selection works surprisingly well in comparison with existing competitors. The method is also applied to a real data set. 相似文献

20.

A variable selection method for detecting abnormality based on the T2 test

N. Shinozaki T. Iida 《统计学通讯:理论与方法》2017,46(17):8603-8617

This paper proposes a variable selection method for detecting abnormal items based on the T² test when the observations on abnormal items are available. Based on the unbiased estimates of the powers for all subsets of variables, the variable selection method selects the subset of variables that maximizes the power estimate. Since more than one subsets of variables maximize the power estimate frequently, the averaged p-value of the rejected items is used as a second criterion. Although the performance of the method depends on the sample size for the abnormal items and the true power values for all subsets of variables, numerical experiments show the effectiveness of the proposed method. Since normal and abnormal items are simulated using one-factor and two-factor models, basic properties of the power functions for the models are investigated. 相似文献