首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Risk estimation is an important statistical question for the purposes of selecting a good estimator (i.e., model selection) and assessing its performance (i.e., estimating generalization error). This article introduces a general framework for cross-validation and derives distributional properties of cross-validated risk estimators in the context of estimator selection and performance assessment. Arbitrary classes of estimators are considered, including density estimators and predictors for both continuous and polychotomous outcomes. Results are provided for general full data loss functions (e.g., absolute and squared error, indicator, negative log density). A broad definition of cross-validation is used in order to cover leave-one-out cross-validation, V-fold cross-validation, Monte Carlo cross-validation, and bootstrap procedures. For estimator selection, finite sample risk bounds are derived and applied to establish the asymptotic optimality of cross-validation, in the sense that a selector based on a cross-validated risk estimator performs asymptotically as well as an optimal oracle selector based on the risk under the true, unknown data generating distribution. The asymptotic results are derived under the assumption that the size of the validation sets converges to infinity and hence do not cover leave-one-out cross-validation. For performance assessment, cross-validated risk estimators are shown to be consistent and asymptotically linear for the risk under the true data generating distribution and confidence intervals are derived for this unknown risk. Unlike previously published results, the theorems derived in this and our related articles apply to general data generating distributions, loss functions (i.e., parameters), estimators, and cross-validation procedures.  相似文献   

2.
There are several procedures for fitting generalized additive models, i.e. regression models for an exponential family response where the influence of each single covariates is assumed to have unknown, potentially non-linear shape. Simulated data are used to compare a smoothing parameter optimization approach for selection of smoothness and of covariates, a stepwise approach, a mixed model approach, and a procedure based on boosting techniques. In particular it is investigated how the performance of procedures is linked to amount of information, type of response, total number of covariates, number of influential covariates, and extent of non-linearity. Measures for comparison are prediction performance, identification of influential covariates, and smoothness of fitted functions. One result is that the mixed model approach returns sparse fits with frequently over-smoothed functions, while the functions are less smooth for the boosting approach and variable selection is less strict. The other approaches are in between with respect to these measures. The boosting procedure is seen to perform very well when little information is available and/or when a large number of covariates is to be investigated. It is somewhat surprising that in scenarios with low information the fitting of a linear model, even with stepwise variable selection, has not much advantage over the fitting of an additive model when the true underlying structure is linear. In cases with more information the prediction performance of all procedures is very similar. So, in difficult data situations the boosting approach can be recommended, in others the procedures can be chosen conditional on the aim of the analysis.  相似文献   

3.
We apply statistical selection theory to multiple target detection problems by analyzing the Mahalanobis distances between multivariate normal populations and a desired standard (a known characteristic of a target). We want to achieve the goal of selecting a subset that contains no non target (negative) sites, which entails screening out all non targets. Correct selection (CS) is defined according to this goal. We consider two cases: (1) that all covariance matrices are known; and (2) that all covariance matrices are unknown, including both heteroscedastic and homoscedastic cases. Optimal selection procedures are proposed in order to reach the selection goal. The least favorable configurations (LFC) are found. Tables and figures are presented to illustrate the properties of our proposed procedures. Simulation examples are given to show that our procedures work well. The log-concavity results of the operating characteristic functions are also given.  相似文献   

4.
If a number of candidate variables are available, variable selection is a key task aiming to identify those candidates which influence the outcome of interest. Methods as backward elimination, forward selection, etc. are often implemented, despite their drawbacks. One of these drawbacks is the instability of their results with respect to small perturbations in the data. To handle this issue, resampling-based procedures have been introduced; using a resampling technique, e.g. bootstrap, these procedures generate several pseudo-samples that are used to compute the inclusion frequency of each variable, i.e. the proportion of pseudo-samples in which the variable is selected. Based on the inclusion frequencies, it is possible to discriminate between relevant and irrelevant variables. These procedures may fail in case of correlated variables. To deal with this issue, two procedures based on 2×2 tables of inclusion frequencies have been developed in the literature. In this paper we analyse the behaviours of these two procedures and the role of their tuning parameters in an extensive simulation study.  相似文献   

5.
This paper studies subset selection procedures for screening in two-factor treatment designs that employ either a split-plot or strip-plot randomization restricted experimental design laid out in blocks. The goal is to select a subset of treatment combinations associated with the largest mean. In the split-plot design, it is assumed that the block effects, the confounding effects (whole-plot error) and the measurement errors are normally distributed. None of the selection procedures developed depend on the block variances. Subset selection procedures are given for both the case of additive and non-additive factors and for a variety of circumstances concerning the confounding effect and measurement error variances. In particular, procedures are given for (1) known confounding effect and measurement error variances (2) unknown measurement error variance but known confounding effect (3) unknown confounding effect and measurement error variances. The constants required to implement the procedures are shown to be obtainable from available FORTRAN programs and tables. Generalization to the case of strip-plot randomization restriction is considered.  相似文献   

6.
In this article we consider a problem of selecting the best normal population that is better than a standard when the variances are unequal. Single-stage selection procedures are proposed when the variances are known. Wilcox (1984) and Taneja and Dudewicz (1992) proposed two-stage selection procedures when the variances are unknown. In addition to these procedures, we propose a two-stage selection procedure based on the method of Lam (1988). Comparisons are made between these selection procedures in terms of the sample sizes.  相似文献   

7.
Let X be a random n-vector whose density function is given by a mixture of known multivariate normal density functions where the corresponding mixture proportions (a priori probabilities) are unknown. We present a numerically tractable method for obtaining estimates of the mixture proportions based on the linear feature selection technique of Guseman, Peters and Walker (1975).  相似文献   

8.
Several procedures for ranking populations according to the quantile of a given order have been discussed in the literature. These procedures deal with continuous distributions. This paper deals with the problem of selecting a population with the largest α-quantile from k ≥ 2 finite populatins, where the size of each population is known. A selection rule is given based on the sample quantiles, where he samples are drawn without replacement. A formula for the minimum probability of a correct selection for the given rule, for a certain configuration of the population α-quantiles, is given in terms of the sample numbers.  相似文献   

9.
Lasso is popularly used for variable selection in recent years. In this paper, lasso-type penalty functions including lasso and adaptive lasso are employed in simultaneously variable selection and parameter estimation for covariate-adjusted linear model, where the predictors and response cannot be observed directly and distorted by some observable covariate through some unknown multiplicative smooth functions. Estimation procedures are proposed and some asymptotic properties are obtained under some mild conditions. It deserves noting that under appropriate conditions, the adaptive lasso estimator correctly select covariates with nonzero coefficients with probability converging to one and that the estimators of nonzero coefficients have the same asymptotic distribution that they would have if the zero coefficients were known in advance, i.e. the adaptive lasso estimator has the oracle property in the sense of Fan and Li [6]. Simulation studies are carried out to examine its performance in finite sample situations and the Boston Housing data is analyzed for illustration.  相似文献   

10.
Non‐random sampling is a source of bias in empirical research. It is common for the outcomes of interest (e.g. wage distribution) to be skewed in the source population. Sometimes, the outcomes are further subjected to sample selection, which is a type of missing data, resulting in partial observability. Thus, methods based on complete cases for skew data are inadequate for the analysis of such data and a general sample selection model is required. Heckman proposed a full maximum likelihood estimation method under the normality assumption for sample selection problems, and parametric and non‐parametric extensions have been proposed. We generalize Heckman selection model to allow for underlying skew‐normal distributions. Finite‐sample performance of the maximum likelihood estimator of the model is studied via simulation. Applications illustrate the strength of the model in capturing spurious skewness in bounded scores, and in modelling data where logarithm transformation could not mitigate the effect of inherent skewness in the outcome variable.  相似文献   

11.
Inverse regression estimation for censored data   总被引:1,自引:0,他引:1  
An inverse regression methodology for assessing predictor performance in the censored data setup is developed along with inference procedures and a computational algorithm. The technique developed here allows for conditioning on the unobserved failure time along with a weighting mechanism that accounts for the censoring. The implementation is nonparametric and computationally fast. This provides an efficient methodological tool that can be used especially in cases where the usual modeling assumptions are not applicable to the data under consideration. It can also be a good diagnostic tool that can be used in the model selection process. We have provided theoretical justification of consistency and asymptotic normality of the methodology. Simulation studies and two data analyses are provided to illustrate the practical utility of the procedure.  相似文献   

12.
In a two-treatment trial, a two-sided test is often used to reach a conclusion, Usually we are interested in doing a two-sided test because of no prior preference between the two treatments and we want a three-decision framework. When a standard control is just as good as the new experimental treatment (which has the same toxicity and cost), then we will accept both treatments. Only when the standard control is clearly worse or better than the new experimental treatment, then we choose only one treatment. In this paper, we extend the concept of a two-sided test to the multiple treatment trial where three or more treatments are involved. The procedure turns out to be a subset selection procedure; however, the theoretical framework and performance requirement are different from the existing subset selection procedures. Two procedures (exclusion or inclusion) are developed here for the case of normal data with equal known variance. If the sample size is large, they can be applied with unknown variance and with the binomial data or survival data with random censoring.  相似文献   

13.
Our goal is to find a regression technique that can be used in a small-sample situation with possible model misspecification. The development of a new bandwidth selector allows nonparametric regression (in conjunction with least squares) to be used in this small-sample problem, where nonparametric procedures have previously proven to be inadequate. Considered here are two new semiparametric (model-robust) regression techniques that combine parametric and nonparametric techniques when there is partial information present about the underlying model. A general overview is given of how typical concerns for bandwidth selection in nonparametric regression extend to the model-robust procedures. A new penalized PRESS criterion (with a graphical selection strategy for applications) is developed that overcomes these concerns and is able to maintain the beneficial mean squared error properties of the new model-robust methods. It is shown that this new selector outperforms standard and recently improved bandwidth selectors. Comparisons of the selectors are made via numerous generated data examples and a small simulation study.  相似文献   

14.
A technique for selection procedures, called sequential rejection, is investigated. It is shown that this technique is posssible to apply to certain selection goals of the "all or nothing" type, i.e. "selecting a subset containing all good populations" or "selecting a subset containing no bad population". The analogy with existing sequential techniques in the general theory of simultaneous statistical inference is pointed out.  相似文献   

15.
For the last decade, various simulation-based nonlinear and non-Gaussian filters and smoothers have been proposed. In the case where the unknown parameters are included in the nonlinear and non-Gaussian system, however, it is very difficult to estimate the parameters together with the state variables, because the state-space model includes a lot of parameters in general and the simulation-based procedures are subject to the simulation errors or the sampling errors. Therefore, clearly, precise estimates of the parameters cannot be obtained (i.e., the obtained estimates may not be the global optima). In this paper, an attempt is made to estimate the state variables and the unknown parameters simultaneously, where the Monte Carlo optimization procedure is adopted for maximization of the likelihood function.  相似文献   

16.
Abstract

Structured sparsity has recently been a very popular technique to deal with the high-dimensional data. In this paper, we mainly focus on the theoretical problems for the overlapping group structure of generalized linear models (GLMs). Although the overlapping group lasso method for GLMs has been widely applied in some applications, the theoretical properties about it are still unknown. Under some general conditions, we presents the oracle inequalities for the estimation and prediction error of overlapping group Lasso method in the generalized linear model setting. Then, we apply these results to the so-called Logistic and Poisson regression models. It is shown that the results of the Lasso and group Lasso procedures for GLMs can be recovered by specifying the group structures in our proposed method. The effect of overlap and the performance of variable selection of our proposed method are both studied by numerical simulations. Finally, we apply our proposed method to two gene expression data sets: the p53 data and the lung cancer data.  相似文献   

17.
In this paper, statistical inferences for the size-biased Weibull distribution in two different cases are drawn. In the first case where the size r of the bias is considered known, it is proven that the maximum-likelihood estimators (MLEs) always exist. In the second case where the size r is considered as an unknown parameter, the estimating equations for the MLEs are presented and the Fisher information matrix is found. The estimation with the method of moments can be utilized in the case the MLEs do not exist. The advantage of treating r as an unknown parameter is that it allows us to perform tests concerning the existence of size-bias in the sample. Finally a program in Mathematica is written which provides all the statistical results from the procedures developed in this paper.  相似文献   

18.
Selection of the “best” t out of k populations has been considered in the indifferece zone formulation by Bachhofer (1954) and in the subset selection formulation by Carroll, Gupta and Huang (1975). The latter approach is used here to obtain conservative solutions for the goals of selecting (i) all the “good” or (ii) only “good” populations, where “good” means having a location parameter among the largest t. For the case of normal distributions, with common unknown variance, tables are produced for implementing these procedures. Also, for this case, simulation results suggest that the procedure may not be too conservative.  相似文献   

19.
We propose a new criterion for model selection in prediction problems. The covariance inflation criterion adjusts the training error by the average covariance of the predictions and responses, when the prediction rule is applied to permuted versions of the data set. This criterion can be applied to general prediction problems (e.g. regression or classification) and to general prediction rules (e.g. stepwise regression, tree-based models and neural nets). As a by-product we obtain a measure of the effective number of parameters used by an adaptive procedure. We relate the covariance inflation criterion to other model selection procedures and illustrate its use in some regression and classification problems. We also revisit the conditional bootstrap approach to model selection.  相似文献   

20.
A procedure for estimating the location parameter of an unknown symmetric distribution is developed for application to samples from very light-tailed through very heavy-tailed distributions. This procedure has an easy extension to a technique for estimating the coefficients in a linear regression model whose error distribution is symmetric with arbitrary tail weights. The regression procedure is, in turn, extended to make it applicable to situations where the error distribution is either symmetric or skewed. The potentials of the procedures for robust location parameter and regression coefficient estimation are demonstrated by simulation studies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号