首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Graphs and networks are common ways of depicting information. In biology, many different biological processes are represented by graphs, such as regulatory networks, metabolic pathways and protein-protein interaction networks. This kind of a priori use of graphs is a useful supplement to the standard numerical data such as microarray gene expression data. In this paper, we consider the problem of regression analysis and variable selection when the covariates are linked on a graph. We study a graph-constrained regularization procedure and its theoretical properties for regression analysis to take into account the neighborhood information of the variables measured on a graph, where a smoothness penalty on the coefficients is defined as a quadratic form of the Laplacian matrix associated with the graph. We establish estimation and model selection consistency results and provide estimation bounds for both fixed and diverging numbers of parameters in regression models. We demonstrate by simulations and a real dataset that the proposed procedure can lead to better variable selection and prediction than existing methods that ignore the graph information associated with the covariates.  相似文献   

2.
Variable selection in cluster analysis is important yet challenging. It can be achieved by regularization methods, which realize a trade-off between the clustering accuracy and the number of selected variables by using a lasso-type penalty. However, the calibration of the penalty term can suffer from criticisms. Model selection methods are an efficient alternative, yet they require a difficult optimization of an information criterion which involves combinatorial problems. First, most of these optimization algorithms are based on a suboptimal procedure (e.g. stepwise method). Second, the algorithms are often computationally expensive because they need multiple calls of EM algorithms. Here we propose to use a new information criterion based on the integrated complete-data likelihood. It does not require the maximum likelihood estimate and its maximization appears to be simple and computationally efficient. The original contribution of our approach is to perform the model selection without requiring any parameter estimation. Then, parameter inference is needed only for the unique selected model. This approach is used for the variable selection of a Gaussian mixture model with conditional independence assumed. The numerical experiments on simulated and benchmark datasets show that the proposed method often outperforms two classical approaches for variable selection. The proposed approach is implemented in the R package VarSelLCM available on CRAN.  相似文献   

3.
Variable selection for nonlinear regression is a complex problem, made even more difficult when there are a large number of potential covariates and a limited number of datapoints. We propose herein a multi-stage method that combines state-of-the-art techniques at each stage to best discover the relevant variables. At the first stage, an extension of the Bayesian Additive Regression tree is adopted to reduce the total number of variables to around 30. At the second stage, sensitivity analysis in the treed Gaussian process is adopted to further reduce the total number of variables. Two stopping rules are designed and sequential design is adopted to make best use of previous information. We demonstrate our approach on two simulated examples and one real data set.  相似文献   

4.
Calibration techniques in survey sampling, such as generalized regression estimation (GREG), were formalized in the 1990s to produce efficient estimators of linear combinations of study variables, such as totals or means. They implicitly lie on the assumption of a linear regression model between the variable of interest and some auxiliary variables in order to yield estimates with lower variance if the model is true and remaining approximately design-unbiased even if the model does not hold. We propose a new class of model-assisted estimators obtained by releasing a few calibration constraints and replacing them with a penalty term. This penalization is added to the distance criterion to minimize. By introducing the concept of penalized calibration, combining usual calibration and this ‘relaxed’ calibration, we are able to adjust the weight given to the available auxiliary information. We obtain a more flexible estimation procedure giving better estimates particularly when the auxiliary information is overly abundant or not fully appropriate to be completely used. Such an approach can also be seen as a design-based alternative to the estimation procedures based on the more general class of mixed models, presenting new prospects in some scopes of application such as inference on small domains.  相似文献   

5.
We propose two new procedures based on multiple hypothesis testing for correct support estimation in high‐dimensional sparse linear models. We conclusively prove that both procedures are powerful and do not require the sample size to be large. The first procedure tackles the atypical setting of ordered variable selection through an extension of a testing procedure previously developed in the context of a linear hypothesis. The second procedure is the main contribution of this paper. It enables data analysts to perform support estimation in the general high‐dimensional framework of non‐ordered variable selection. A thorough simulation study and applications to real datasets using the R package mht shows that our non‐ordered variable procedure produces excellent results in terms of correct support estimation as well as in terms of mean square errors and false discovery rate, when compared to common methods such as the Lasso, the SCAD penalty, forward regression or the false discovery rate procedure (FDR).  相似文献   

6.
In this article, we generalize the partially linear single-index models to the scenario with some endogenous covariates variables. It is well known that the estimators based on the existing methods are often inconsistent because of the endogeneity of covariates. To deal with the endogenous variables, we introduce some auxiliary instrumental variables. A three-stage estimation procedure is proposed for partially linear single-index instrumental variables models. The first stage is to obtain a linear projection of endogenous variables on a set of instrumental variables, the second stage is to estimate the link function by using local linear smoother for given constant parameters, and the last stage is to obtain the estimators of constant parameters based on the estimating equation. Asymptotic normality is established for the proposed estimators. Some simulation studies are undertaken to assess the finite sample performance of the proposed estimation procedure.  相似文献   

7.
8.
Mixed model selection is quite important in statistical literature. To assist the mixed model selection, we employ the adaptive LASSO penalized term to propose a two-stage selection procedure for the purpose of choosing both the random and fixed effects. In the first stage, we utilize the penalized restricted profile log-likelihood to choose the random effects; in the second stage, after the random effects are determined, we apply the penalized profile log-likelihood to select the fixed effects. In each stage, the Newton–Raphson algorithm is performed to complete the parameter estimation. We prove that the proposed procedure is consistent and possesses the oracle properties. The simulations and a real data application are conducted for demonstrating the effectiveness of the proposed selection procedure.  相似文献   

9.
The Lasso has sparked interest in the use of penalization of the log‐likelihood for variable selection, as well as for shrinkage. We are particularly interested in the more‐variables‐than‐observations case of characteristic importance for modern data. The Bayesian interpretation of the Lasso as the maximum a posteriori estimate of the regression coefficients, which have been given independent, double exponential prior distributions, is adopted. Generalizing this prior provides a family of hyper‐Lasso penalty functions, which includes the quasi‐Cauchy distribution of Johnstone and Silverman as a special case. The properties of this approach, including the oracle property, are explored, and an EM algorithm for inference in regression problems is described. The posterior is multi‐modal, and we suggest a strategy of using a set of perfectly fitting random starting values to explore modes in different regions of the parameter space. Simulations show that our procedure provides significant improvements on a range of established procedures, and we provide an example from chemometrics.  相似文献   

10.
ABSTRACT

We study estimation and inference when there are multiple values (“matches”) for the explanatory variables and only one of the matches is the correct one. This problem arises often when two datasets are linked together on the basis of information that does not uniquely identify regressor values. We offer a set of two intuitive conditions that ensure consistent inference using the average of the possible matches in a linear framework. The first condition is the exogeneity of the false match with respect to the regression error. The second condition is a notion of exchangeability between the true and false matches. Conditioning on the observed data, the probability that each match is correct is completely unrestricted. We perform a Monte Carlo study to investigate the estimator’s finite-sample performance relative to others proposed in the literature. Finally, we provide an empirical example revisiting a main area of application: the measurement of intergenerational elasticities in income. Supplementary materials for this article are available online.  相似文献   

11.
The different parts (variables) of a compositional data set cannot be considered independent from each other, since only the ratios between the parts constitute the relevant information to be analysed. Practically, this information can be included in a system of orthonormal coordinates. For the task of regression of one part on other parts, a specific choice of orthonormal coordinates is proposed which allows for an interpretation of the regression parameters in terms of the original parts. In this context, orthogonal regression is appropriate since all compositional parts – also the explanatory variables – are measured with errors. Besides classical (least-squares based) parameter estimation, also robust estimation based on robust principal component analysis is employed. Statistical inference for the regression parameters is obtained by bootstrap; in the robust version the fast and robust bootstrap procedure is used. The methodology is illustrated with a data set from macroeconomics.  相似文献   

12.
We study the estimation and variable selection for a partial linear single index model (PLSIM) when some linear covariates are not observed, but their ancillary variables are available. We use the semiparametric profile least-square based estimation procedure to estimate the parameters in the PLSIM after the calibrated error-prone covariates are obtained. Asymptotic normality for the estimators are established. We also employ the smoothly clipped absolute deviation (SCAD) penalty to select the relevant variables in the PLSIM. The resulting SCAD estimators are shown to be asymptotically normal and have the oracle property. Performance of our estimation procedure is illustrated through numerous simulations. The approach is further applied to a real data example.  相似文献   

13.
Often the variables in a regression model are difficult or expensive to obtain so auxiliary variables are collected in a preliminary step of a study and the model variables are measured at later stages on only a subsample of the study participants called the validation sample. We consider a study in which at the first stage some variables, throughout called auxiliaries, are collected; at the second stage the true outcome is measured on a subsample of the first-stage sample, and at the third stage the true covariates are collected on a subset of the second-stage sample. In order to increase efficiency, the probabilities of selection into the second and third-stage samples are allowed to depend on the data observed at the previous stages. In this paper we describe a class of inverse-probability-of-selection-weighted semiparametric estimators for the parameters of the model for the conditional mean of the outcomes given the covariates. We assume that a subject's probability of being sampled at subsequent stages is bounded away from zero and depends only on the subject's data collected at the previous sampling stages. We show that the asymptotic variance of the optimal estimator in our class is equal to the semiparametric variance bound for the model. Since the optimal estimator depends on unknown population parameters it is not available for data analysis. We therefore propose an adaptive estimation procedure for locally efficient inferences. A simulation study is carried out to study the finite sample properties of the proposed estimators.  相似文献   

14.
Abstract

We propose a new class of two-stage parameter estimation methods for semiparametric ordinary differential equation (ODE) models. In the first stage, state variables are estimated using a penalized spline approach; In the second stage, form of numerical discretization algorithms for an ODE solver is used to formulate estimating equations. Estimated state variables from the first stage are used to obtain more data points for the second stage. Asymptotic properties for the proposed estimators are established. Simulation studies show that the method performs well, especially for small sample. Real life use of the method is illustrated using Influenza specific cell-trafficking study.  相似文献   

15.
This paper considers variable and factor selection in factor analysis. We treat the factor loadings for each observable variable as a group, and introduce a weighted sparse group lasso penalty to the complete log-likelihood. The proposal simultaneously selects observable variables and latent factors of a factor analysis model in a data-driven fashion; it produces a more flexible and sparse factor loading structure than existing methods. For parameter estimation, we derive an expectation-maximization algorithm that optimizes the penalized log-likelihood. The tuning parameters of the procedure are selected by a likelihood cross-validation criterion that yields satisfactory results in various simulation settings. Simulation results reveal that the proposed method can better identify the possibly sparse structure of the true factor loading matrix with higher estimation accuracy than existing methods. A real data example is also presented to demonstrate its performance in practice.  相似文献   

16.
The author introduces robust techniques for estimation, inference and variable selection in the analysis of longitudinal data. She first addresses the problem of the robust estimation of the regression and nuisance parameters, for which she derives the asymptotic distribution. She uses weighted estimating equations to build robust quasi‐likelihood functions. These functions are then used to construct a class of test statistics for variable selection. She derives the limiting distribution of these tests and shows its robustness properties in terms of stability of the asymptotic level and power under contamination. An application to a real data set allows her to illustrate the benefits of a robust analysis.  相似文献   

17.
Regularization methods for simultaneous variable selection and coefficient estimation have been shown to be effective in quantile regression in improving the prediction accuracy. In this article, we propose the Bayesian bridge for variable selection and coefficient estimation in quantile regression. A simple and efficient Gibbs sampling algorithm was developed for posterior inference using a scale mixture of uniform representation of the Bayesian bridge prior. This is the first work to discuss regularized quantile regression with the bridge penalty. Both simulated and real data examples show that the proposed method often outperforms quantile regression without regularization, lasso quantile regression, and Bayesian lasso quantile regression.  相似文献   

18.
A nonparametric inference algorithm developed by Davis and Geman (1983) is extended problem. The algorithm and applied to a medical prediction employs an estimation procedure for acquiring pairwise statistics among variables of a binary data set, allows for the data-driven creation of interaction terms among the variables, and employs a decision rule which asymptotically gives the minimum expected error. The inference procedure was designed for large data sets but has been extended via the method of cross-validation to encompass smaller data sets.  相似文献   

19.
In many regression problems, predictors are naturally grouped. For example, when a set of dummy variables is used to represent categorical variables, or a set of basis functions of continuous variables is included in the predictor set, it is important to carry out a feature selection both at the group level and at individual variable levels within the group simultaneously. To incorporate the group and variables within-group information into a regularized model fitting, several regularization methods have been developed, including the Cox regression and the conditional mean regression. Complementary to earlier works, the simultaneous group and within-group variables selection method is examined in quantile regression. We propose a hierarchically penalized quantile regression, and show that the hierarchical penalty possesses the oracle property in quantile regression, as well as in the Cox regression. The proposed method is evaluated through simulation studies and a real data application.  相似文献   

20.
随着计算机的飞速发展,极大地便利了数据的获取和存储,很多企业积累了大量的数据,同时数据的维度也越来越高,噪声变量越来越多,因此在建模分析时面临的重要问题之一就是从高维的变量中筛选出少数的重要变量。针对因变量取值为(0,1)区间的比例数据提出了正则化Beta回归,研究了在LASSO、SCAD和MCP三种惩罚方法下的极大似然估计及其渐进性质。统计模拟表明MCP的方法会优于SCAD和LASSO,并且随着样本量的增大,SCAD的方法也将优于LASSO。最后,将该方法应用到中国上市公司股息率的影响因素研究中。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号