首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Abstract

Missing data arise frequently in clinical and epidemiological fields, in particular in longitudinal studies. This paper describes the core features of an R package wgeesel, which implements marginal model fitting (i.e., weighted generalized estimating equations, WGEE; doubly robust GEE) for longitudinal data with dropouts under the assumption of missing at random. More importantly, this package comprehensively provide existing information criteria for WGEE model selection on marginal mean or correlation structures. Also, it can serve as a valuable tool for simulating longitudinal data with missing outcomes. Lastly, a real data example and simulations are presented to illustrate and validate our package.  相似文献   

2.
In this paper, we focus on the feature extraction and variable selection of massive data which is divided and stored in different linked computers. Specifically, we study the distributed model selection with the Smoothly Clipped Absolute Deviation (SCAD) penalty. Based on the Alternating Direction Method of Multipliers (ADMM) algorithm, we propose distributed SCAD algorithm and prove its convergence. The results of variable selection of the distributed approach are same with the results of the non-distributed approach. Numerical studies show that our method is both effective and efficient which performs well in distributed data analysis.  相似文献   

3.
Model selection methods are important to identify the best approximating model. To identify the best meaningful model, purpose of the model should be clearly pre-stated. The focus of this paper is model selection when the modelling purpose is classification. We propose a new model selection approach designed for logistic regression model selection where main modelling purpose is classification. The method is based on the distance between the two clustering trees. We also question and evaluate the performances of conventional model selection methods based on information theory concepts in determining best logistic regression classifier. An extensive simulation study is used to assess the finite sample performances of the cluster tree based and the information theoretic model selection methods. Simulations are adjusted for whether the true model is in the candidate set or not. Results show that the new approach is highly promising. Finally, they are applied to a real data set to select a binary model as a means of classifying the subjects with respect to their risk of breast cancer.  相似文献   

4.
Many model‐free dimension reduction methods have been developed for high‐dimensional regression data but have not paid much attention on problems with non‐linear confounding. In this paper, we propose an inverse‐regression method of dependent variable transformation for detecting the presence of non‐linear confounding. The benefit of using geometrical information from our method is highlighted. A ratio estimation strategy is incorporated in our approach to enhance the interpretation of variable selection. This approach can be implemented not only in principal Hessian directions (PHD) but also in other recently developed dimension reduction methods. Several simulation examples that are reported for illustration and comparisons are made with sliced inverse regression and PHD in ignorance of non‐linear confounding. An illustrative application to one real data is also presented.  相似文献   

5.
A Bayesian approach is developed for analysing item response models with nonignorable missing data. The relevant model for the observed data is estimated concurrently in conjunction with the item response model for the missing-data process. Since the approach is fully Bayesian, it can be easily generalized to more complicated and realistic models, such as those models with covariates. Furthermore, the proposed approach is illustrated with item response data modelled as the multidimensional graded response models. Finally, a simulation study is conducted to assess the extent to which the bias caused by ignoring the missing-data mechanism can be reduced.  相似文献   

6.
Abstract. Lasso and other regularization procedures are attractive methods for variable selection, subject to a proper choice of shrinkage parameter. Given a set of potential subsets produced by a regularization algorithm, a consistent model selection criterion is proposed to select the best one among this preselected set. The approach leads to a fast and efficient procedure for variable selection, especially in high‐dimensional settings. Model selection consistency of the suggested criterion is proven when the number of covariates d is fixed. Simulation studies suggest that the criterion still enjoys model selection consistency when d is much larger than the sample size. The simulations also show that our approach for variable selection works surprisingly well in comparison with existing competitors. The method is also applied to a real data set.  相似文献   

7.
Evolutionary ecology is the study of evolutionary processes, and the ecological conditions that influence them. A fundamental paradigm underlying the study of evolution is natural selection. Although there are a variety of operational definitions for natural selection in the literature, perhaps the most general one is that which characterizes selection as the process whereby heritable variation in fitness associated with variation in one or more phenotypic traits leads to intergenerational change in the frequency distribution of those traits. The past 20 years have witnessed a marked increase in the precision and reliability of our ability to estimate one or more components of fitness and characterize natural selection in wild populations, owing particularly to significant advances in methods for analysis of data from marked individuals. In this paper, we focus on several issues that we believe are important considerations for the application and development of these methods in the context of addressing questions in evolutionary ecology. First, our traditional approach to estimation often rests upon analysis of aggregates of individuals, which in the wild may reflect increasingly non-random (selected) samples with respect to the trait(s) of interest. In some cases, analysis at the aggregate level, rather than the individual level, may obscure important patterns. While there are a growing number of analytical tools available to estimate parameters at the individual level, and which can cope (to varying degrees) with progressive selection of the sample, the advent of new methods does not reduce the need to consider carefully the appropriate level of analysis in the first place. Estimation should be motivated a priori by strong theoretical analysis. Doing so provides clear guidance, in terms of both (i) assisting in the identification of realistic and meaningful models to include in the candidate model set, and (ii) providing the appropriate context under which the results are interpreted. Second, while it is true that selection (as defined) operates at the level of the individual, the selection gradient is often (if not generally) conditional on the abundance of the population. As such, it may be important to consider estimating transition rates conditional on both the parameter values of the other individuals in the population (or at least their distribution), and population abundance. This will undoubtedly pose a considerable challenge, for both single- and multi-strata applications. It will also require renewed consideration of the estimation of abundance, especially for open populations. Thirdly, selection typically operates on dynamic, individually varying traits. Such estimation may require characterizing fitness in terms of individual plasticity in one or more state variables, constituting analysis of the norms of reaction of individuals to variable environments. This can be quite complex, especially for traits that are under facultative control. Recent work has indicated that the pattern of selection on such traits is conditional on the relative rates of movement among and frequency of spatially heterogeneous habitats, suggesting analyses of evolution of life histories in open populations can be misleading in some cases.  相似文献   

8.
Evolutionary ecology is the study of evolutionary processes, and the ecological conditions that influence them. A fundamental paradigm underlying the study of evolution is natural selection. Although there are a variety of operational definitions for natural selection in the literature, perhaps the most general one is that which characterizes selection as the process whereby heritable variation in fitness associated with variation in one or more phenotypic traits leads to intergenerational change in the frequency distribution of those traits. The past 20 years have witnessed a marked increase in the precision and reliability of our ability to estimate one or more components of fitness and characterize natural selection in wild populations, owing particularly to significant advances in methods for analysis of data from marked individuals. In this paper, we focus on several issues that we believe are important considerations for the application and development of these methods in the context of addressing questions in evolutionary ecology. First, our traditional approach to estimation often rests upon analysis of aggregates of individuals, which in the wild may reflect increasingly non-random (selected) samples with respect to the trait(s) of interest. In some cases, analysis at the aggregate level, rather than the individual level, may obscure important patterns. While there are a growing number of analytical tools available to estimate parameters at the individual level, and which can cope (to varying degrees) with progressive selection of the sample, the advent of new methods does not reduce the need to consider carefully the appropriate level of analysis in the first place. Estimation should be motivated a priori by strong theoretical analysis. Doing so provides clear guidance, in terms of both (i) assisting in the identification of realistic and meaningful models to include in the candidate model set, and (ii) providing the appropriate context under which the results are interpreted. Second, while it is true that selection (as defined) operates at the level of the individual, the selection gradient is often (if not generally) conditional on the abundance of the population. As such, it may be important to consider estimating transition rates conditional on both the parameter values of the other individuals in the population (or at least their distribution), and population abundance. This will undoubtedly pose a considerable challenge, for both single- and multi-strata applications. It will also require renewed consideration of the estimation of abundance, especially for open populations. Thirdly, selection typically operates on dynamic, individually varying traits. Such estimation may require characterizing fitness in terms of individual plasticity in one or more state variables, constituting analysis of the norms of reaction of individuals to variable environments. This can be quite complex, especially for traits that are under facultative control. Recent work has indicated that the pattern of selection on such traits is conditional on the relative rates of movement among and frequency of spatially heterogeneous habitats, suggesting analyses of evolution of life histories in open populations can be misleading in some cases.  相似文献   

9.
The goal of this paper is to compare several widely used Bayesian model selection methods in practical model selection problems, highlight their differences and give recommendations about the preferred approaches. We focus on the variable subset selection for regression and classification and perform several numerical experiments using both simulated and real world data. The results show that the optimization of a utility estimate such as the cross-validation (CV) score is liable to finding overfitted models due to relatively high variance in the utility estimates when the data is scarce. This can also lead to substantial selection induced bias and optimism in the performance evaluation for the selected model. From a predictive viewpoint, best results are obtained by accounting for model uncertainty by forming the full encompassing model, such as the Bayesian model averaging solution over the candidate models. If the encompassing model is too complex, it can be robustly simplified by the projection method, in which the information of the full model is projected onto the submodels. This approach is substantially less prone to overfitting than selection based on CV-score. Overall, the projection method appears to outperform also the maximum a posteriori model and the selection of the most probable variables. The study also demonstrates that the model selection can greatly benefit from using cross-validation outside the searching process both for guiding the model size selection and assessing the predictive performance of the finally selected model.  相似文献   

10.
In data sets with many predictors, algorithms for identifying a good subset of predictors are often used. Most such algorithms do not allow for any relationships between predictors. For example, stepwise regression might select a model containing an interaction AB but neither main effect A or B. This paper develops mathematical representations of this and other relations between predictors, which may then be incorporated in a model selection procedure. A Bayesian approach that goes beyond the standard independence prior for variable selection is adopted, and preference for certain models is interpreted as prior information. Priors relevant to arbitrary interactions and polynomials, dummy variables for categorical factors, competing predictors, and restrictions on the size of the models are developed. Since the relations developed are for priors, they may be incorporated in any Bayesian variable selection algorithm for any type of linear model. The application of the methods here is illustrated via the stochastic search variable selection algorithm of George and McCulloch (1993), which is modified to utilize the new priors. The performance of the approach is illustrated with two constructed examples and a computer performance dataset.  相似文献   

11.
While Bayesian analogues of lasso regression have become popular, comparatively little has been said about formal treatments of model uncertainty in such settings. This paper describes methods that can be used to evaluate the posterior distribution over the space of all possible regression models for Bayesian lasso regression. Access to the model space posterior distribution is necessary if model-averaged inference—e.g., model-averaged prediction and calculation of posterior variable inclusion probabilities—is desired. The key element of all such inference is the ability to evaluate the marginal likelihood of the data under a given regression model, which has so far proved difficult for the Bayesian lasso. This paper describes how the marginal likelihood can be accurately computed when the number of predictors in the model is not too large, allowing for model space enumeration when the total number of possible predictors is modest. In cases where the total number of possible predictors is large, a simple Markov chain Monte Carlo approach for sampling the model space posterior is provided. This Gibbs sampling approach is similar in spirit to the stochastic search variable selection methods that have become one of the main tools for addressing Bayesian regression model uncertainty, and the adaption of these methods to the Bayesian lasso is shown to be straightforward.  相似文献   

12.
Based on B-spline basis functions and smoothly clipped absolute deviation (SCAD) penalty, we present a new estimation and variable selection procedure based on modal regression for partially linear additive models. The outstanding merit of the new method is that it is robust against outliers or heavy-tail error distributions and performs no worse than the least-square-based estimation for normal error case. The main difference is that the standard quadratic loss is replaced by a kernel function depending on a bandwidth that can be automatically selected based on the observed data. With appropriate selection of the regularization parameters, the new method possesses the consistency in variable selection and oracle property in estimation. Finally, both simulation study and real data analysis are performed to examine the performance of our approach.  相似文献   

13.
方匡南  杨阳 《统计研究》2018,35(8):104-115
针对分类问题,本文提出了稀疏组Lasso支持向量机方法(Sparse group lasso SVM, SGL-SVM),即在SVM模型的损失函数中引入SGL惩罚函数,能同时进行组间变量和组内变量的筛选。由于SGL-SVM的目标函数求解比较复杂,本文又提出了一种快速的双层坐标下降算法。通过模拟实验,发现SGL-SVM方法在预测效果和变量选择上均要好于其他方法,对于变量具有自然分组结构且组内是稀疏的数据,本文方法在提高变量选择效果的同时又能提高模型的预测精度。最后,将本文提出的SGL-SVM方法应用到我国制造业上市公司财务困境预测中。  相似文献   

14.
Varying-coefficient models have been widely used to investigate the possible time-dependent effects of covariates when the response variable comes from normal distribution. Much progress has been made for inference and variable selection in the framework of such models. However, the identification of model structure, that is how to identify which covariates have time-varying effects and which have fixed effects, remains a challenging and unsolved problem especially when the dimension of covariates is much larger than the sample size. In this article, we consider the structural identification and variable selection problems in varying-coefficient models for high-dimensional data. Using a modified basis expansion approach and group variable selection methods, we propose a unified procedure to simultaneously identify the model structure, select important variables and estimate the coefficient curves. The unique feature of the proposed approach is that we do not have to specify the model structure in advance, therefore, it is more realistic and appropriate for real data analysis. Asymptotic properties of the proposed estimators have been derived under regular conditions. Furthermore, we evaluate the finite sample performance of the proposed methods with Monte Carlo simulation studies and a real data analysis.  相似文献   

15.
Variable selection is an important task in regression analysis. Performance of the statistical model highly depends on the determination of the subset of predictors. There are several methods to select most relevant variables to construct a good model. However in practice, the dependent variable may have positive continuous values and not normally distributed. In such situations, gamma distribution is more suitable than normal for building a regression model. This paper introduces an heuristic approach to perform variable selection using artificial bee colony optimization for gamma regression models. We evaluated the proposed method against with classical selection methods such as backward and stepwise. Both simulation studies and real data set examples proved the accuracy of our selection procedure.  相似文献   

16.
Most of the longitudinal data contain influential points and for analyzing them generalized and weighted generalized estimating equations (GEEs and WGEEs) are highly influenced by these points. An approach for dealing with outliers is having weight functions. In this article, we propose some new weights based on the statistical depth. These weights express centrality of points with respect to the whole sample with a smaller depth (larger depth) for the point far from the center (for the point near the center). The proposed approach leads to robust WGEE. These approaches are applied on two real datasets and some simulation studies.  相似文献   

17.
In this article we discuss variable selection for decision making with focus on decisions regarding when to provide treatment and which treatment to provide. Current variable selection techniques were developed for use in a supervised learning setting where the goal is prediction of the response. These techniques often downplay the importance of interaction variables that have small predictive ability but that are critical when the ultimate goal is decision making rather than prediction. We propose two new techniques designed specifically to find variables that aid in decision making. Simulation results are given along with an application of the methods on data from a randomized controlled trial for the treatment of depression.  相似文献   

18.
Biomarkers have the potential to improve our understanding of disease diagnosis and prognosis. Biomarker levels that fall below the assay detection limits (DLs), however, compromise the application of biomarkers in research and practice. Most existing methods to handle non-detects focus on a scenario in which the response variable is subject to the DL; only a few methods consider explanatory variables when dealing with DLs. We propose a Bayesian approach for generalized linear models with explanatory variables subject to lower, upper, or interval DLs. In simulation studies, we compared the proposed Bayesian approach to four commonly used methods in a logistic regression model with explanatory variable measurements subject to the DL. We also applied the Bayesian approach and other four methods in a real study, in which a panel of cytokine biomarkers was studied for their association with acute lung injury (ALI). We found that IL8 was associated with a moderate increase in risk for ALI in the model based on the proposed Bayesian approach.  相似文献   

19.
20.
A multistage variable selection method is introduced for detecting association signals in structured brain‐wide and genome‐wide association studies (brain‐GWAS). Compared to conventional methods that link one voxel to one single nucleotide polymorphism (SNP), our approach is more efficient and powerful in selecting the important signals by integrating anatomic and gene grouping structures in the brain and the genome, respectively. It avoids resorting to a large number of multiple comparisons while effectively controlling the false discoveries. Validity of the proposed approach is demonstrated by both theoretical investigation and numerical simulations. We apply our proposed method to a brain‐GWAS using Alzheimer's Disease Neuroimaging Initiative positron emission tomography (ADNI PET) imaging and genomic data. We confirm previously reported association signals and also uncover several novel SNPs and genes that are either associated with brain glucose metabolism or have their association significantly modified by Alzheimer's disease status.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号