首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
Variable and model selection problems are fundamental to high-dimensional statistical modeling in diverse fields of sciences. Especially in health studies, many potential factors are usually introduced to determine an outcome variable. This paper deals with the problem of high-dimensional statistical modeling through the analysis of the trauma annual data in Greece for 2005. The data set is divided into the experiment and control sets and consists of 6334 observations and 112 factors that include demographic, transport and intrahospital data used to detect possible risk factors of death. In our study, different model selection techniques are applied to the experiment set and the notion of deviance is used on the control set to assess the fit of the overall selected model. The statistical methods employed in this work were the non-concave penalized likelihood methods, smoothly clipped absolute deviation, least absolute shrinkage and selection operator, and Hard, the generalized linear logistic regression, and the best subset variable selection.The way of identifying the significant variables in large medical data sets along with the performance and the pros and cons of the various statistical techniques used are discussed. The performed analysis reveals the distinct advantages of the non-concave penalized likelihood methods over the traditional model selection techniques.  相似文献   

Prostate cancer (PrCA) is the most common cancer diagnosed in American men and the second leading cause of death from malignancies. There are large geographical variation and racial disparities existing in the survival rate of PrCA. Much work on the spatial survival model is based on the proportional hazards (PH) model, but few focused on the accelerated failure time (AFT) model. In this paper, we investigate the PrCA data of Louisiana from the Surveillance, Epidemiology, and End Results program and the violation of the PH assumption suggests that the spatial survival model based on the AFT model is more appropriate for this data set. To account for the possible extra-variation, we consider spatially referenced independent or dependent spatial structures. The deviance information criterion is used to select a best-fitting model within the Bayesian frame work. The results from our study indicate that age, race, stage, and geographical distribution are significant in evaluating PrCA survival.  相似文献   

A new method for detecting the parameter changes in generalized autoregressive heteroskedasticity GARCH (1,1) model is proposed. In the proposed method, time series observations are divided into several segments and a GARCH (1,1) model is fitted to each segment. The goodness-of-fit of the global model composed of these local GARCH (1,1) models is evaluated using the corresponding information criterion (IC). The division that minimizes IC defines the best model. Furthermore, since the simultaneous estimation of all possible models requires huge computational time, a new time-saving algorithm is proposed. Simulation results and empirical results both indicate that the proposed method is useful in analysing financial data.  相似文献   

Variational Bayes (VB) estimation is a fast alternative to Markov Chain Monte Carlo for performing approximate Baesian inference. This procedure can be an efficient and effective means of analyzing large datasets. However, VB estimation is often criticised, typically on empirical grounds, for being unable to produce valid statistical inferences. In this article we refute this criticism for one of the simplest models where Bayesian inference is not analytically tractable, that is, the Bayesian linear model (for a particular choice of priors). We prove that under mild regularity conditions, VB based estimators enjoy some desirable frequentist properties such as consistency and can be used to obtain asymptotically valid standard errors. In addition to these results we introduce two VB information criteria: the variational Akaike information criterion and the variational Bayesian information criterion. We show that variational Akaike information criterion is asymptotically equivalent to the frequentist Akaike information criterion and that the variational Bayesian information criterion is first order equivalent to the Bayesian information criterion in linear regression. These results motivate the potential use of the variational information criteria for more complex models. We support our theoretical results with numerical examples.  相似文献   

This article focuses on the clustering problem based on Dirichlet process (DP) mixtures. To model both time invariant and temporal patterns, different from other existing clustering methods, the proposed semi-parametric model is flexible in that both the common and unique patterns are taken into account simultaneously. Furthermore, by jointly clustering subjects and the associated variables, the intrinsic complex shared patterns among subjects and among variables are expected to be captured. The number of clusters and cluster assignments are directly inferred with the use of DP. Simulation studies illustrate the effectiveness of the proposed method. An application to wheal size data is discussed with an aim of identifying novel temporal patterns among allergens within subject clusters.  相似文献   

This paper addresses the problem of identifying groups that satisfy the specific conditions for the means of feature variables. In this study, we refer to the identified groups as “target clusters” (TCs). To identify TCs, we propose a method based on the normal mixture model (NMM) restricted by a linear combination of means. We provide an expectation–maximization (EM) algorithm to fit the restricted NMM by using the maximum-likelihood method. The convergence property of the EM algorithm and a reasonable set of initial estimates are presented. We demonstrate the method's usefulness and validity through a simulation study and two well-known data sets. The proposed method provides several types of useful clusters, which would be difficult to achieve with conventional clustering or exploratory data analysis methods based on the ordinary NMM. A simple comparison with another target clustering approach shows that the proposed method is promising in the identification.  相似文献   

Identification of different gene expressions of chickpea (Cicer arietinum) plant tissue is needed in order to develop new varieties of chickpea plant which is resistant to disease through the insertion of genes. This plant is the third legume plant of the Leguminosae (Fabaceae) family and is much needed in the world due to its high-protein seeds and roots that contain symbiotic nitrogen-fixing bacteria. This paper has succeeded to demonstrate the work of Bayesian mixture model averaging (BMMA) approach to identify the different gene expressions of chickpea plant tissue in Indonesia. The results show that the best BMMA normal models contain from 727 (73%) up to 939 (94%) models from 1,000 generated mixture normal models. The fitted BMMA models to gene expression differences data on average is 0.2878511 for Kolmogorov–Smirnov (KS) and 0.1278080 for continuous rank probability score (CRPS). Based on these BMMA models, there are three groups of gene IDs: downregulated, regulated, and upregulated. The results of this grouping can be useful to find new varieties of chickpea plants that are more resistant to disease. The BMMA normal models coupled with Occam's window as a data-driven modeling have succeed to demonstrate the work of building the gene expression differences microarray experiments data.  相似文献   

An usual approach for selection of the best subset AR model of known maximal order is to use an appropriate information criterion, like AIC or SIC with an exhaustive selection of regressors and to choose the subset model that produces the optimum (minimum) value of AIC or SIC. This method is computationally intensive. Proposed is a method based on the use of singular value decomposition and QR with column pivoting factorization for extracting a reduced subset from the exhaustive candidate set of regressors and to use AIC or SIC on the reduced subset to obtain the best subset AR model. The result is substantially reduced domain of exhaustive search for the computation of the best subset AR model.  相似文献   

We investigate the exact coverage and expected length properties of the model averaged tail area (MATA) confidence interval proposed by Turek and Fletcher, CSDA, 2012, in the context of two nested, normal linear regression models. The simpler model is obtained by applying a single linear constraint on the regression parameter vector of the full model. For given length of response vector and nominal coverage of the MATA confidence interval, we consider all possible models of this type and all possible true parameter values, together with a wide class of design matrices and parameters of interest. Our results show that, while not ideal, MATA confidence intervals perform surprisingly well in our regression scenario, provided that we use the minimum weight within the class of weights that we consider on the simpler model.  相似文献   

This paper presents an extension of mean-squared forecast error (MSFE) model averaging for integrating linear regression models computed on data frames of various lengths. Proposed method is considered to be a preferable alternative to best model selection by various efficiency criteria such as Bayesian information criterion (BIC), Akaike information criterion (AIC), F-statistics and mean-squared error (MSE) as well as to Bayesian model averaging (BMA) and naïve simple forecast average. The method is developed to deal with possibly non-nested models having different number of observations and selects forecast weights by minimizing the unbiased estimator of MSFE. Proposed method also yields forecast confidence intervals with a given significance level what is not possible when applying other model averaging methods. In addition, out-of-sample simulation and empirical testing proves efficiency of such kind of averaging when forecasting economic processes.  相似文献   

高维稀疏数据的特征选择是互联网舆情文本聚类分析的关键。借鉴罚模型思想,利用罚多项混合模型,给不显著影响聚类结果的特征予较重惩罚的方式实现特征选择,可有效选出代表舆情各类观点的典型词汇,实证应用中有较为理想的表现。  相似文献   

The variational approach to Bayesian inference enables simultaneous estimation of model parameters and model complexity. An interesting feature of this approach is that it also leads to an automatic choice of model complexity. Empirical results from the analysis of hidden Markov models with Gaussian observation densities illustrate this. If the variational algorithm is initialized with a large number of hidden states, redundant states are eliminated as the method converges to a solution, thereby leading to a selection of the number of hidden states. In addition, through the use of a variational approximation, the deviance information criterion for Bayesian model selection can be extended to the hidden Markov model framework. Calculation of the deviance information criterion provides a further tool for model selection, which can be used in conjunction with the variational approach.  相似文献   

We study model selection and model averaging in semiparametric partially linear models with missing responses. An imputation method is used to estimate the linear regression coefficients and the nonparametric function. We show that the corresponding estimators of the linear regression coefficients are asymptotically normal. Then a focused information criterion and frequentist model average estimators are proposed and their theoretical properties are established. Simulation studies are performed to demonstrate the superiority of the proposed methods over the existing strategies in terms of mean squared error and coverage probability. Finally, the approach is applied to a real data case.  相似文献   

Panel count data arise in many fields and a number of estimation procedures have been developed along with two procedures for variable selection. In this paper, we discuss model selection and parameter estimation together. For the former, a focused information criterion (FIC) is presented and for the latter, a frequentist model average (FMA) estimation procedure is developed. A main advantage, also the difference from the existing model selection methods, of the FIC is that it emphasizes the accuracy of the estimation of the parameters of interest, rather than all parameters. Further efficiency gain can be achieved by the FMA estimation procedure as unlike existing methods, it takes into account the variability in the stage of model selection. Asymptotic properties of the proposed estimators are established, and a simulation study conducted suggests that the proposed methods work well for practical situations. An illustrative example is also provided. © 2014 Board of the Foundation of the Scandinavian Journal of Statistics  相似文献   

We address the problem of robust model selection for finite memory stochastic processes. Consider m independent samples, with most of them being realizations of the same stochastic process with law Q, which is the one we want to retrieve. We define the asymptotic breakdown point γ for a model selection procedure and also we devise a model selection procedure. We compute the value of γ which is 0.5, when all the processes are Markovian. This result is valid for any family of finite order Markov models but for simplicity we will focus on the family of variable length Markov chains.  相似文献   

Important progress has been made with model averaging methods over the past decades. For spatial data, however, the idea of model averaging has not been applied well. This article studies model averaging methods for the spatial geostatistical linear model. A spatial Mallows criterion is developed to choose weights for the model averaging estimator. The resulting estimator can achieve asymptotic optimality in terms of L2 loss. Simulation experiments reveal that our proposed estimator is superior to the model averaging estimator by the Mallows criterion developed for ordinary linear models [Hansen, 2007] and the model selection estimator using the corrected Akaike's information criterion, developed for geostatistical linear models [Hoeting et al., 2006]. The Canadian Journal of Statistics 47: 336–351; 2019 © 2019 Statistical Society of Canada  相似文献   

The analysis of word frequency count data can be very useful in authorship attribution problems. Zero-truncated generalized inverse Gaussian–Poisson mixture models are very helpful in the analysis of these kinds of data because their model-mixing density estimates can be used as estimates of the density of the word frequencies of the vocabulary. It is found that this model provides excellent fits for the word frequency counts of very long texts, where the truncated inverse Gaussian–Poisson special case fails because it does not allow for the large degree of over-dispersion in the data. The role played by the three parameters of this truncated GIG-Poisson model is also explored. Our second goal is to compare the fit of the truncated GIG-Poisson mixture model with the fit of the model that results from switching the order of the mixing and truncation stages. A heuristic interpretation of the mixing distribution estimates obtained under this alternative GIG-truncated Poisson mixture model is also provided.  相似文献   

Focusing on the model selection problems in the family of Poisson mixture models (including the Poisson mixture regression model with random effects and zero‐inflated Poisson regression model with random effects), the current paper derives two conditional Akaike information criteria. The criteria are the unbiased estimators of the conditional Akaike information based on the conditional log‐likelihood and the conditional Akaike information based on the joint log‐likelihood, respectively. The derivation is free from the specific parametric assumptions about the conditional mean of the true data‐generating model and applies to different types of estimation methods. Additionally, the derivation is not based on the asymptotic argument. Simulations show that the proposed criteria have promising estimation accuracy. In addition, it is found that the criterion based on the conditional log‐likelihood demonstrates good model selection performance under different scenarios. Two sets of real data are used to illustrate the proposed method.  相似文献   

To measure the distance between a robust function evaluated under the true regression model and under a fitted model, we propose generalized Kullback–Leibler information. Using this generalization we have developed three robust model selection criteria, AICR*, AICCR* and AICCR, that allow the selection of candidate models that not only fit the majority of the data but also take into account non-normally distributed errors. The AICR* and AICCR criteria can unify most existing Akaike information criteria; three examples of such unification are given. Simulation studies are presented to illustrate the relative performance of each criterion.  相似文献   

M-estimation is a widely used technique for robust statistical inference. In this paper, we study model selection and model averaging for M-estimation to simultaneously improve the coverage probability of confidence intervals of the parameters of interest and reduce the impact of heavy-tailed errors or outliers in the response. Under general conditions, we develop robust versions of the focused information criterion and a frequentist model average estimator for M-estimation, and we examine their theoretical properties. In addition, we carry out extensive simulation studies as well as two real examples to assess the performance of our new procedure, and find that the proposed method produces satisfactory results.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号