首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 765 毫秒
1.
The penalized logistic regression is a useful tool for classifying samples and feature selection. Although the methodology has been widely used in various fields of research, their performance takes a sudden turn for the worst in the presence of outlier, since the logistic regression is based on the maximum log-likelihood method which is sensitive to outliers. It implies that we cannot accurately classify samples and find important factors having crucial information for classification. To overcome the problem, we propose a robust penalized logistic regression based on a weighted likelihood methodology. We also derive an information criterion for choosing the tuning parameters, which is a vital matter in robust penalized logistic regression modelling in line with generalized information criteria. We demonstrate through Monte Carlo simulations and real-world example that the proposed robust modelling strategies perform well for sparse logistic regression modelling even in the presence of outliers.  相似文献   

2.
The purpose of this article is to present a new method to predict the response variable of an observation in a new cluster for a multilevel logistic regression. The central idea is based on the empirical best estimator for the random effect. Two estimation methods for multilevel model are compared: penalized quasi-likelihood and Gauss–Hermite quadrature. The performance measures for the prediction of the probability for a new cluster observation of the multilevel logistic model in comparison with the usual logistic model are examined through simulations and an application.  相似文献   

3.
Summary. The classical approach to statistical analysis is usually based upon finding values for model parameters that maximize the likelihood function. Model choice in this context is often also based on the likelihood function, but with the addition of a penalty term for the number of parameters. Though models may be compared pairwise by using likelihood ratio tests for example, various criteria such as the Akaike information criterion have been proposed as alternatives when multiple models need to be compared. In practical terms, the classical approach to model selection usually involves maximizing the likelihood function associated with each competing model and then calculating the corresponding criteria value(s). However, when large numbers of models are possible, this quickly becomes infeasible unless a method that simultaneously maximizes over both parameter and model space is available. We propose an extension to the traditional simulated annealing algorithm that allows for moves that not only change parameter values but also move between competing models. This transdimensional simulated annealing algorithm can therefore be used to locate models and parameters that minimize criteria such as the Akaike information criterion, but within a single algorithm, removing the need for large numbers of simulations to be run. We discuss the implementation of the transdimensional simulated annealing algorithm and use simulation studies to examine its performance in realistically complex modelling situations. We illustrate our ideas with a pedagogic example based on the analysis of an autoregressive time series and two more detailed examples: one on variable selection for logistic regression and the other on model selection for the analysis of integrated recapture–recovery data.  相似文献   

4.
We propose a new criterion for model selection in prediction problems. The covariance inflation criterion adjusts the training error by the average covariance of the predictions and responses, when the prediction rule is applied to permuted versions of the data set. This criterion can be applied to general prediction problems (e.g. regression or classification) and to general prediction rules (e.g. stepwise regression, tree-based models and neural nets). As a by-product we obtain a measure of the effective number of parameters used by an adaptive procedure. We relate the covariance inflation criterion to other model selection procedures and illustrate its use in some regression and classification problems. We also revisit the conditional bootstrap approach to model selection.  相似文献   

5.
The goal of this paper is to compare several widely used Bayesian model selection methods in practical model selection problems, highlight their differences and give recommendations about the preferred approaches. We focus on the variable subset selection for regression and classification and perform several numerical experiments using both simulated and real world data. The results show that the optimization of a utility estimate such as the cross-validation (CV) score is liable to finding overfitted models due to relatively high variance in the utility estimates when the data is scarce. This can also lead to substantial selection induced bias and optimism in the performance evaluation for the selected model. From a predictive viewpoint, best results are obtained by accounting for model uncertainty by forming the full encompassing model, such as the Bayesian model averaging solution over the candidate models. If the encompassing model is too complex, it can be robustly simplified by the projection method, in which the information of the full model is projected onto the submodels. This approach is substantially less prone to overfitting than selection based on CV-score. Overall, the projection method appears to outperform also the maximum a posteriori model and the selection of the most probable variables. The study also demonstrates that the model selection can greatly benefit from using cross-validation outside the searching process both for guiding the model size selection and assessing the predictive performance of the finally selected model.  相似文献   

6.
Classical statistical approaches for multiclass probability estimation are typically based on regression techniques such as multiple logistic regression, or density estimation approaches such as linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). These methods often make certain assumptions on the form of probability functions or on the underlying distributions of subclasses. In this article, we develop a model-free procedure to estimate multiclass probabilities based on large-margin classifiers. In particular, the new estimation scheme is employed by solving a series of weighted large-margin classifiers and then systematically extracting the probability information from these multiple classification rules. A main advantage of the proposed probability estimation technique is that it does not impose any strong parametric assumption on the underlying distribution and can be applied for a wide range of large-margin classification methods. A general computational algorithm is developed for class probability estimation. Furthermore, we establish asymptotic consistency of the probability estimates. Both simulated and real data examples are presented to illustrate competitive performance of the new approach and compare it with several other existing methods.  相似文献   

7.
As no single classification method outperforms other classification methods under all circumstances, decision-makers may solve a classification problem using several classification methods and examine their performance for classification purposes in the learning set. Based on this performance, better classification methods might be adopted and poor methods might be avoided. However, which single classification method is the best to predict the classification of new observations is still not clear, especially when some methods offer similar classification performance in the learning set. In this article we present various regression and classical methods, which combine several classification methods to predict the classification of new observations. The quality of the combined classifiers is examined on some real data. Nonparametric regression is the best method of combining classifiers.  相似文献   

8.
The problem of constructing classification methods based on both labeled and unlabeled data sets is considered for analyzing data with complex structures. We introduce a semi-supervised logistic discriminant model with Gaussian basis expansions. Unknown parameters included in the logistic model are estimated by regularization method along with the technique of EM algorithm. For selection of adjusted parameters, we derive a model selection criterion from Bayesian viewpoints. Numerical studies are conducted to investigate the effectiveness of our proposed modeling procedures.  相似文献   

9.
The purpose of this paper is to develop a Bayesian approach for the Weibull-Negative-Binomial regression model with cure rate under latent failure causes and presence of randomized activation mechanisms. We assume the number of competing causes of the event of interest follows a Negative Binomial (NB) distribution while the latent lifetimes are assumed to follow a Weibull distribution. Markov chain Monte Carlos (MCMC) methods are used to develop the Bayesian procedure. Model selection to compare the fitted models is discussed. Moreover, we develop case deletion influence diagnostics for the joint posterior distribution based on the ψ-divergence, which has several divergence measures as particular cases. The developed procedures are illustrated with a real data set.  相似文献   

10.
Studies of risk perceived using continuous scales of [0,100] were recently introduced in psychometrics, which can be transformed to the unit interval, but the presence of zeros or ones are commonly observed. Motivated by this, we introduce a full inferential set of tools that allows for augmented and limited data modeling. We considered parameter estimation, residual analysis, influence diagnostic and model selection for zero-and/or-one augmented beta rectangular (ZOABR) regression models and their particular nested models, which is based on a new parameterization of the beta rectangular distribution. Different from other alternatives, we performed maximum-likelihood estimation using a combination of the EM algorithm (for the continuous part) and Fisher scoring algorithm (for the discrete part). Also, we perform an additional step, by considering other link functions, besides the usual logistic link, for modeling the response mean. By considering randomized quantile residuals, (local) influence diagnostics and model selection tools, we identified that the ZOABR regression model is the best one. We also conducted extensive simulations studies, which indicate that all developed tools work properly. Finally, we discuss the use of this type of models to treat psychometric data. It is worthwhile to mention that applications of the developed methods go beyond to Psychometric data. Indeed, they can be useful when the response variable in bounded, including or not the respective limits.  相似文献   

11.
Efficient statistical inference on nonignorable missing data is a challenging problem. This paper proposes a new estimation procedure based on composite quantile regression (CQR) for linear regression models with nonignorable missing data, that is applicable even with high-dimensional covariates. A parametric model is assumed for modelling response probability, which is estimated by the empirical likelihood approach. Local identifiability of the proposed strategy is guaranteed on the basis of an instrumental variable approach. A set of data-based adaptive weights constructed via an empirical likelihood method is used to weight CQR functions. The proposed method is resistant to heavy-tailed errors or outliers in the response. An adaptive penalisation method for variable selection is proposed to achieve sparsity with high-dimensional covariates. Limiting distributions of the proposed estimators are derived. Simulation studies are conducted to investigate the finite sample performance of the proposed methodologies. An application to the ACTG 175 data is analysed.  相似文献   

12.
王小燕等 《统计研究》2014,31(9):107-112
变量选择是统计建模的重要环节,选择合适的变量可以建立结构简单、预测精准的稳健模型。本文在logistic回归下提出了新的双层变量选择惩罚方法——adaptive Sparse Group Lasso(adSGL),其独特之处在于基于变量的分组结构作筛选,实现了组内和组间双层选择。该方法的优点是对各单个系数和组系数采取不同程度的惩罚,避免了过度惩罚大系数,从而提高了模型的估计和预测精度。求解的难点是惩罚似然函数不是严格凸的,因此本文基于组坐标下降法求解模型,并建立了调整参数的选取准则。模拟分析表明,对比现有代表性方法Sparse Group Lasso、Group Lasso及Lasso,adSGL法不仅提高了双层选择精度,而且降低了模型误差。最后本文将adSGL法应用到信用卡信用评分研究,对比logistic回归,它具有更高的分类精度和稳健性。  相似文献   

13.

Regression spline smoothing is a popular approach for conducting nonparametric regression. An important issue associated with it is the choice of a "theoretically best" set of knots. Different statistical model selection methods, such as Akaike's information criterion and generalized cross-validation, have been applied to derive different "theoretically best" sets of knots. Typically these best knot sets are defined implicitly as the optimizers of some objective functions. Hence another equally important issue concerning regression spline smoothing is how to optimize such objective functions. In this article different numerical algorithms that are designed for carrying out such optimization problems are compared by means of a simulation study. Both the univariate and bivariate smoothing settings will be considered. Based on the simulation results, recommendations for choosing a suitable optimization algorithm under various settings will be provided.  相似文献   

14.
Case-base sampling provides an alternative to risk set sampling based methods to estimate hazard regression models, in particular when absolute hazards are also of interest in addition to hazard ratios. The case-base sampling approach results in a likelihood expression of the logistic regression form, but instead of categorized time, such an expression is obtained through sampling of a discrete set of person-time coordinates from all follow-up data. In this paper, in the context of a time-dependent exposure such as vaccination, and a potentially recurrent adverse event outcome, we show that the resulting partial likelihood for the outcome event intensity has the asymptotic properties of a likelihood. We contrast this approach to self-matched case-base sampling, which involves only within-individual comparisons. The efficiency of the case-base methods is compared to that of standard methods through simulations, suggesting that the information loss due to sampling is minimal.  相似文献   

15.
We used two statistical methods to identify prognostic factors: a log-linear model (logistic and COX regression, based on the notions of linearity and multiplicative relative risk), and the CORICO method (ICOnography of CORrelations) based on the geometric significance of the correlation coefficient. We applied the methods to two different situations (a "case-control study' and a "historical cohort'). We show that the geometric exploratory tool is particularly suited to the analysis of small samples with a large number of variables. It could save time when setting up new study protocols. In this instance, the geometric approach highlighted, without preconceived ideas, the potential role of multihormonality in the course of pituitary adenoma and the unexpected influence of the date of tumour excision on the risk attached to haemorrhage.  相似文献   

16.
Summary.  We develop a new class of time continuous autoregressive fractionally integrated moving average (CARFIMA) models which are useful for modelling regularly spaced and irregu-larly spaced discrete time long memory data. We derive the autocovariance function of a stationary CARFIMA model and study maximum likelihood estimation of a regression model with CARFIMA errors, based on discrete time data and via the innovations algorithm. It is shown that the maximum likelihood estimator is asymptotically normal, and its finite sample properties are studied through simulation. The efficacy of the approach proposed is demonstrated with a data set from an environmental study.  相似文献   

17.
In the problem of selecting variables in a multivariate linear regression model, we derive new Bayesian information criteria based on a prior mixing a smooth distribution and a delta distribution. Each of them can be interpreted as a fusion of the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Inheriting their asymptotic properties, our information criteria are consistent in variable selection in both the large-sample and the high-dimensional asymptotic frameworks. In numerical simulations, variable selection methods based on our information criteria choose the true set of variables with high probability in most cases.  相似文献   

18.
Qunfang Xu 《Statistics》2017,51(6):1280-1303
In this paper, semiparametric modelling for longitudinal data with an unstructured error process is considered. We propose a partially linear additive regression model for longitudinal data in which within-subject variances and covariances of the error process are described by unknown univariate and bivariate functions, respectively. We provide an estimating approach in which polynomial splines are used to approximate the additive nonparametric components and the within-subject variance and covariance functions are estimated nonparametrically. Both the asymptotic normality of the resulting parametric component estimators and optimal convergence rate of the resulting nonparametric component estimators are established. In addition, we develop a variable selection procedure to identify significant parametric and nonparametric components simultaneously. We show that the proposed SCAD penalty-based estimators of non-zero components have an oracle property. Some simulation studies are conducted to examine the finite-sample performance of the proposed estimation and variable selection procedures. A real data set is also analysed to demonstrate the usefulness of the proposed method.  相似文献   

19.
Variable and model selection problems are fundamental to high-dimensional statistical modeling in diverse fields of sciences. Especially in health studies, many potential factors are usually introduced to determine an outcome variable. This paper deals with the problem of high-dimensional statistical modeling through the analysis of the trauma annual data in Greece for 2005. The data set is divided into the experiment and control sets and consists of 6334 observations and 112 factors that include demographic, transport and intrahospital data used to detect possible risk factors of death. In our study, different model selection techniques are applied to the experiment set and the notion of deviance is used on the control set to assess the fit of the overall selected model. The statistical methods employed in this work were the non-concave penalized likelihood methods, smoothly clipped absolute deviation, least absolute shrinkage and selection operator, and Hard, the generalized linear logistic regression, and the best subset variable selection.The way of identifying the significant variables in large medical data sets along with the performance and the pros and cons of the various statistical techniques used are discussed. The performed analysis reveals the distinct advantages of the non-concave penalized likelihood methods over the traditional model selection techniques.  相似文献   

20.
Summary. Enormous quantities of geoelectrical data are produced daily and often used for large scale reservoir modelling. To interpret these data requires reliable and efficient inversion methods which adequately incorporate prior information and use realistically complex modelling structures. We use models based on random coloured polygonal graphs as a powerful and flexible modelling framework for the layered composition of the Earth and we contrast our approach with earlier methods based on smooth Gaussian fields. We demonstrate how the reconstruction algorithm may be efficiently implemented through the use of multigrid Metropolis–coupled Markov chain Monte Carlo methods and illustrate the method on a set of field data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号