首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We investigated CART performance with a unimodal response curve for one continuous response and four continuous explanatory variables, where two variables were important (i.e. directly related to the response) and the other two were not. We explored performance under three relationship strengths and two explanatory variable conditions: equal importance and one variable four times as important as the other. We compared CART variable selection performance using three tree-selection rules ('minimum risk', 'minimum risk complexity', 'one standard error') to stepwise polynomial ordinary least squares (OLS) under four sample size conditions. The one-standard-error and minimum risk-complexity methods performed about as well as stepwise OLS with large sample sizes when the relationship was strong. With weaker relationships, equally important explanatory variables and larger sample sizes, the one-standard-error and minimum-risk-complexity rules performed better than stepwise OLS. With weaker relationships and explanatory variables of unequal importance, tree-structured methods did not perform as well as stepwise OLS. Comparing performance within tree-structured methods, with a strong relationship and equally important explanatory variables, the one-standard-error rule was more likely to choose the correct model than were the other tree-selection rules. The minimum-risk-complexity rule was more likely to choose the correct model than were the other tree-selection rules (1) with weaker relationships and equally important explanatory variables; and (2) under all relationship strengths when explanatory variables were of unequal importance and sample sizes were lower.  相似文献   

2.
文章基于解释变量与被解释变量之间的互信息提出一种新的变量选择方法:MI-SIS。该方法可以处理解释变量数目p远大于观测样本量n的超高维问题,即p=O(exp(nε))ε>0。另外,该方法是一种不依赖于模型假设的变量选择方法。数值模拟和实证研究表明,MI-SIS方法在小样本情形下能够有效地发现微弱信号。  相似文献   

3.
This paper studies the outlier detection and robust variable selection problem in the linear regression model. The penalized weighted least absolute deviation (PWLAD) regression estimation method and the adaptive least absolute shrinkage and selection operator (LASSO) are combined to simultaneously achieve outlier detection, and robust variable selection. An iterative algorithm is proposed to solve the proposed optimization problem. Monte Carlo studies are evaluated the finite-sample performance of the proposed methods. The results indicate that the finite sample performance of the proposed methods performs better than that of the existing methods when there are leverage points or outliers in the response variable or explanatory variables. Finally, we apply the proposed methodology to analyze two real datasets.  相似文献   

4.
In many medical studies patients are nested or clustered within doctor. With many explanatory variables, variable selection with clustered data can be challenging. We propose a method for variable selection based on random forest that addresses clustered data through stratified binary splits. Our motivating example involves the detection orthopedic device components from a large pool of candidates, where each patient belongs to a surgeon. Simulations compare the performance of survival forests grown using the stratified logrank statistic to conventional and robust logrank statistics, as well as a method to select variables using a threshold value based on a variable's empirical null distribution. The stratified logrank test performs superior to conventional and robust methods when data are generated to have cluster-specific effects, and when cluster sizes are sufficiently large, perform comparably to the splitting alternatives in the absence of cluster-specific effects. Thresholding was effective at distinguishing between important and unimportant variables.  相似文献   

5.
An alternative graphical method, called the SSR plot, is proposed for use with a multiple regression model. The new method uses the fact that the sum of squares for regression (SSR) of two explanatory variables can be partitioned into the SSR of one variable and the increment in SSR due to the addition of the second variable. The SSR plot represents each explanatory variable as a vector in a half circle. Our proposed SSR plot explains that the explanatory variables corresponding to the vectors located closer to the horizontal axis have stronger effects on the response variable. Furthermore, for a regression model with two explanatory variables, the magnitude of the angle between two vectors can be used to identify suppression.  相似文献   

6.
A Bayesian method for estimating a time-varying regression model subject to the presence of structural breaks is proposed. Heteroskedastic dynamics, via both GARCH and stochastic volatility specifications, and an autoregressive factor, subject to breaks, are added to generalize the standard return prediction model, in order to efficiently estimate and examine the relationship and how it changes over time. A Bayesian computational method is employed to identify the locations of structural breaks, and for estimation and inference, simultaneously accounting for heteroskedasticity and autocorrelation. The proposed methods are illustrated using simulated data. Then, an empirical study of the Taiwan and Hong Kong stock markets, using oil and gas price returns as a state variable, provides strong support for oil prices being an important explanatory variable for stock returns.  相似文献   

7.
Ridge regression solves multicollinearity problems by introducing a biasing parameter that is called ridge parameter; it shrinks the estimates and their standard errors in order to reach acceptable results. Selection of the ridge parameter was done using several subjective and objective techniques that are concerned with certain criteria. In this study, selection of the ridge parameter depends on other important statistical measures to reach a better value of the ridge parameter. The proposed ridge parameter selection technique depends on a mathematical programming model and the results are evaluated using a simulation study. The performance of the proposed method is good when the error variance is greater than or equal to one; the sample consists of 20 observations, the number of explanatory variables in the model is 2, and there is a very strong correlation between the two explanatory variables.  相似文献   

8.
The analysis of failure time data often involves two strong assumptions. The proportional hazards assumption postulates that hazard rates corresponding to different levels of explanatory variables are proportional. The additive effects assumption specifies that the effect associated with a particular explanatory variable does not depend on the levels of other explanatory variables. A hierarchical Bayes model is presented, under which both assumptions are relaxed. In particular, time-dependent covariate effects are explicitly modelled, and the additivity of effects is relaxed through the use of a modified neural network structure. The hierarchical nature of the model is useful in that it parsimoniously penalizes violations of the two assumptions, with the strength of the penalty being determined by the data.  相似文献   

9.
Joint damage in psoriatic arthritis can be measured by clinical and radiological methods, the former being done more frequently during longitudinal follow-up of patients. Motivated by the need to compare findings based on the different methods with different observation patterns, we consider longitudinal data where the outcome variable is a cumulative total of counts that can be unobserved when other, informative, explanatory variables are recorded. We demonstrate how to calculate the likelihood for such data when it is assumed that the increment in the cumulative total follows a discrete distribution with a location parameter that depends on a linear function of explanatory variables. An approach to the incorporation of informative observation is suggested. We present analyses based on an observational database from a psoriatic arthritis clinic. Although the use of the new statistical methodology has relatively little effect in this example, simulation studies indicate that the method can provide substantial improvements in bias and coverage in some situations where there is an important time varying explanatory variable.  相似文献   

10.
Techniques of credit scoring have been developed these last years in order to reduce the risk taken by banks and financial institutions in the loans that they are granting. Credit Scoring is a classification problem of individuals in one of the two following groups: defaulting borrowers or non-defaulting borrowers. The aim of this paper is to propose a new method of discrimination when the dependent variable is categorical and when a large number of categorical explanatory variables are retained. This method, Categorical Multiblock Linear Discriminant Analysis, computes components which take into account both relationships between explanatory categorical variables and canonical correlation between each explanatory categorical variable and the dependent variable. A comparison with three other techniques and an application on credit scoring data are provided.  相似文献   

11.
In the past decades, the number of variables explaining observations in different practical applications increased gradually. This has led to heavy computational tasks, despite of widely using provisional variable selection methods in data processing. Therefore, more methodological techniques have appeared to reduce the number of explanatory variables without losing much of the information. In these techniques, two distinct approaches are apparent: ‘shrinkage regression’ and ‘sufficient dimension reduction’. Surprisingly, there has not been any communication or comparison between these two methodological categories, and it is not clear when each of these two approaches are appropriate. In this paper, we fill some of this gap by first reviewing each category in brief, paying special attention to the most commonly used methods in each category. We then compare commonly used methods from both categories based on their accuracy, computation time, and their ability to select effective variables. A simulation study on the performance of the methods in each category is generated as well. The selected methods are concurrently tested on two sets of real data which allows us to recommend conditions under which one approach is more appropriate to be applied to high-dimensional data.  相似文献   

12.
闫懋博  田茂再 《统计研究》2021,38(1):147-160
Lasso等惩罚变量选择方法选入模型的变量数受到样本量限制。文献中已有研究变量系数显著性的方法舍弃了未选入模型的变量含有的信息。本文在变量数大于样本量即p>n的高维情况下,使用随机化bootstrap方法获得变量权重,在计算适应性Lasso时构建选择事件的条件分布并剔除系数不显著的变量,以得到最终估计结果。本文的创新点在于提出的方法突破了适应性Lasso可选变量数的限制,当观测数据含有大量干扰变量时能够有效地识别出真实变量与干扰变量。与现有的惩罚变量选择方法相比,多种情境下的模拟研究展示了所提方法在上述两个问题中的优越性。实证研究中对NCI-60癌症细胞系数据进行了分析,结果较以往文献有明显改善。  相似文献   

13.
Two diagnostic plots for selecting explanatory variables are introduced to assess the accuracy of a generalized beta-linear model. The added variable plot is developed to examine the need for adding a new explanatory variable to the model. The constructed variable plot is developed to identify the nonlinearity of the explanatory variable in the model. The two diagnostic procedures are also useful for detecting unusual observations that may affect the regression much. Simulation studies and analysis of two practical examples are conducted to illustrate the performances of the proposed plots.  相似文献   

14.
Empirical likelihood based variable selection   总被引:1,自引:0,他引:1  
Information criteria form an important class of model/variable selection methods in statistical analysis. Parametric likelihood is a crucial part of these methods. In some applications such as the generalized linear models, the models are only specified by a set of estimating functions. To overcome the non-availability of well defined likelihood function, the information criteria under empirical likelihood are introduced. Under this setup, we successfully solve the existence problem of the profile empirical likelihood due to the over constraint in variable selection problems. The asymptotic properties of the new method are investigated. The new method is shown to be consistent at selecting the variables under mild conditions. Simulation studies find that the proposed method has comparable performance to the parametric information criteria when a suitable parametric model is available, and is superior when the parametric model assumption is violated. A real data set is also used to illustrate the usefulness of the new method.  相似文献   

15.
The calibration of forecasts for a sequence of events has an extensive literature. Since calibration does not ensure ‘good’ forecasts, the notion of refinement was introduced to provide a structure into which methods for comparing well-calibrated forecasters could be embedded.In this paper we apply these two concepts, calibration and refinement, to tree-structured statistical probability prediction systems by viewing predictions in terms of the expected value of a response variable given the values of a set of explanatory variables. When all of the variables are categorical, we show that, under suitable conditions, branching at the terminal node of a tree by adding another explanatory variable yields a tree with more refined predictions.  相似文献   

16.
Tree-based models (TBMs) can substitute missing data using the surrogate approach (SUR). The aim of this study is to compare the performance of statistical imputation against the performance of SUR in TBMs. Employing empirical data, a TBM was constructed. Thereafter, 10%, 20%, and 40% of variable values appeared as the first split was deleted, and imputed with and without the use of outcome variables in the imputation model (IMP? and IMP+). This was repeated one thousand times. Absolute relative bias above 0.10 was defined as sever (SARB). Subsequently, in a series of simulations, the following parameters were changed: the degree of correlation among variables, the number of variables truly associated with the outcome, and the missing rate. At a 10% missing rate, the proportion of times SARB was observed in either SUR or IMP? was two times higher than in IMP+ (28% versus 13%). When the missing rate was increased to 20%, all these proportions were approximately doubled. Irrespective of the missing rate, IMP+ was about 65% less likely to produce SARB than SUR. Results of IMP? and SUR were comparable up to a 20% missing rate. At a high missing rate, IMP? was 76% more likely to provide SARB estimates. Statistical imputation of missing data and the use of outcome variable in the imputation model is recommended, even in the content of TBM.  相似文献   

17.
This paper deals with the problem of predicting the real‐valued response variable using explanatory variables containing both multivariate random variable and random curve. The proposed functional partial linear single‐index model treats the multivariate random variable as linear part and the random curve as functional single‐index part, respectively. To estimate the non‐parametric link function, the functional single‐index and the parameters in the linear part, a two‐stage estimation procedure is proposed. Compared with existing semi‐parametric methods, the proposed approach requires no initial estimation and iteration. Asymptotical properties are established for both the parameters in the linear part and the functional single‐index. The convergence rate for the non‐parametric link function is also given. In addition, asymptotical normality of the error variance is obtained that facilitates the construction of confidence region and hypothesis testing for the unknown parameter. Numerical experiments including simulation studies and a real‐data analysis are conducted to evaluate the empirical performance of the proposed method.  相似文献   

18.
Bayesian model building techniques are developed for data with a strong time series structure and possibly exogenous explanatory variables that have strong explanatory and predictive power. The emphasis is on finding whether there are any explanatory variables that might be used for modelling if the data have a strong time series structure that should also be included. We use a time series model that is linear in past observations and that can capture both stochastic and deterministic trend, seasonality and serial correlation. We propose the plotting of absolute predictive error against predictive standard deviation. A series of such plots is utilized to determine which of several nested and non-nested models is optimal in terms of minimizing the dispersion of the predictive distribution and restricting predictive outliers. We apply the techniques to modelling monthly counts of fatal road crashes in Australia where economic, consumption and weather variables are available and we find that three such variables should be included in addition to the time series filter. The approach leads to graphical techniques to determine strengths of relationships between the dependent variable and covariates and to detect model inadequacy as well as determining useful numerical summaries.  相似文献   

19.
This paper suggests a new type of mixture regression model, in which each mixture component is explained by its own regressors. Thus, the dependent variable can be driven by one of several unobservable explanatory mechanisms, each of which has its own distinct variables. An extension of the simulated annealing algorithm is introduced to fit this general mixture model. The paper also suggests a new technique for estimating the covariance matrix of estimators in a mixture model. Finally, empirical studies of a labour supply example show that our proposed model can perform much better than conventional logistic or mixture models.  相似文献   

20.
The measurement error model (MEM) is an important model in statistics because in a regression problem, the measurement error of the explanatory variable will seriously affect the statistical inferences if measurement errors are ignored. In this paper, we revisit the MEM when both the response and explanatory variables are further involved with rounding errors. Additionally, the use of a normal mixture distribution to increase the robustness of model misspecification for the distribution of the explanatory variables in measurement error regression is in line with recent developments. This paper proposes a new method for estimating the model parameters. It can be proved that the estimates obtained by the new method possess the properties of consistency and asymptotic normality.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号