首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Feature screening and variable selection are fundamental in analysis of ultrahigh-dimensional data, which are being collected in diverse scientific fields at relatively low cost. Distance correlation-based sure independence screening (DC-SIS) has been proposed to perform feature screening for ultrahigh-dimensional data. The DC-SIS possesses sure screening property and filters out unimportant predictors in a model-free manner. Like all independence screening methods, however, it fails to detect the truly important predictors which are marginally independent of the response variable due to correlations among predictors. When there are many irrelevant predictors which are highly correlated with some strongly active predictors, the independence screening may miss other active predictors with relatively weak marginal signals. To improve the performance of DC-SIS, we introduce an effective iterative procedure based on distance correlation to detect all truly important predictors and potentially interactions in both linear and nonlinear models. Thus, the proposed iterative method possesses the favourable model-free and robust properties. We further illustrate its excellent finite-sample performance through comprehensive simulation studies and an empirical analysis of the rat eye expression data set.  相似文献   

2.
We consider the problem of variable screening in ultra-high-dimensional generalized linear models (GLMs) of nonpolynomial orders. Since the popular SIS approach is extremely unstable in the presence of contamination and noise, we discuss a new robust screening procedure based on the minimum density power divergence estimator (MDPDE) of the marginal regression coefficients. Our proposed screening procedure performs well under pure and contaminated data scenarios. We provide a theoretical motivation for the use of marginal MDPDEs for variable screening from both population as well as sample aspects; in particular, we prove that the marginal MDPDEs are uniformly consistent leading to the sure screening property of our proposed algorithm. Finally, we propose an appropriate MDPDE-based extension for robust conditional screening in GLMs along with the derivation of its sure screening property. Our proposed methods are illustrated through extensive numerical studies along with an interesting real data application.  相似文献   

3.
This paper considers the problem of variance estimation for sparse ultra-high dimensional varying coefficient models. We first use B-spline to approximate the coefficient functions, and discuss the asymptotic behavior of a naive two-stage estimator of error variance. We also reveal that this naive estimator may significantly underestimate the error variance due to the spurious correlations, which are even higher for nonparametric models than linear models. This prompts us to propose an accurate estimator of the error variance by effectively integrating the sure independence screening and the refitted cross-validation techniques. The consistency and the asymptotic normality of the resulting estimator are established under some regularity conditions. The simulation studies are carried out to assess the finite sample performance of the proposed methods.  相似文献   

4.
The varying-coefficient model is an important nonparametric statistical model since it allows appreciable flexibility on the structure of fitted model. For ultra-high dimensional heterogeneous data it is very necessary to examine how the effects of covariates vary with exposure variables at different quantile level of interest. In this paper, we extended the marginal screening methods to examine and select variables by ranking a measure of nonparametric marginal contributions of each covariate given the exposure variable. Spline approximations are employed to model marginal effects and select the set of active variables in quantile-adaptive framework. This ensures the sure screening property in quantile-adaptive varying-coefficient model. Numerical studies demonstrate that the proposed procedure works well for heteroscedastic data.  相似文献   

5.
Generalized additive mixed models are proposed for overdispersed and correlated data, which arise frequently in studies involving clustered, hierarchical and spatial designs. This class of models allows flexible functional dependence of an outcome variable on covariates by using nonparametric regression, while accounting for correlation between observations by using random effects. We estimate nonparametric functions by using smoothing splines and jointly estimate smoothing parameters and variance components by using marginal quasi-likelihood. Because numerical integration is often required by maximizing the objective functions, double penalized quasi-likelihood is proposed to make approximate inference. Frequentist and Bayesian inferences are compared. A key feature of the method proposed is that it allows us to make systematic inference on all model components within a unified parametric mixed model framework and can be easily implemented by fitting a working generalized linear mixed model by using existing statistical software. A bias correction procedure is also proposed to improve the performance of double penalized quasi-likelihood for sparse data. We illustrate the method with an application to infectious disease data and we evaluate its performance through simulation.  相似文献   

6.
We introduce a two-step procedure, in the context of ultra-high dimensional additive models, which aims to reduce the size of covariates vector and distinguish linear and nonlinear effects among nonzero components. Our proposed screening procedure, in the first step, is constructed based on the concept of cumulative distribution function and conditional expectation of response in the framework of marginal correlation. B-splines and empirical distribution functions are used to estimate the two above measures. The sure screening property of this procedure is also established. In the second step, a double penalization based procedure is applied to identify nonzero and linear components, simultaneously. The performance of the designed method is examined by several test functions to show its capabilities against competitor methods when the distribution of errors is varied. Simulation studies imply that the proposed screening procedure can be applied to the ultra-high dimensional data and well detect the influential covariates. It also demonstrate the superiority in comparison with the existing methods. This method is also applied to identify most influential genes for overexpression of a G protein-coupled receptor in mice.  相似文献   

7.
In this article, a new model-free feature screening method named after probability density (mass) function distance (PDFD) correlation is presented for ultrahigh-dimensional data analysis. We improve the fused-Kolmogorov filter (F-KOL) screening procedure through probability density distribution. The proposed method is also fully nonparametric and can be applied to more general types of predictors and responses, including discrete and continuous random variables. Kernel density estimate method and numerical integration are applied to obtain the estimator we proposed. The results of simulation studies indicate that the fused-PDFD performs better than other existing screening methods, such as F-KOL filter, sure-independent screening (SIS), sure independent ranking and screening (SIRS), distance correlation sure-independent screening (DCSIS) and robust ranking correlation screening (RRCS). Finally, we demonstrate the validity of fused-PDFD by a real data example.  相似文献   

8.
In practice, the presence of influential observations may lead to misleading results in variable screening problems. We, therefore, propose a robust variable screening procedure for high-dimensional data analysis in this paper. Our method consists of two steps. The first step is to define a new high-dimensional influence measure and propose a novel influence diagnostic procedure to remove those unusual observations. The second step is to utilize the sure independence screening procedure based on distance correlation to select important variables in high-dimensional regression analysis. The new influence measure and diagnostic procedure that we developed are model free. To confirm the effectiveness of the proposed method, we conduct simulation studies and a real-life data analysis to illustrate the merits of the proposed approach over some competing methods. Both the simulation results and the real-life data analysis demonstrate that the proposed method can greatly control the adverse effect after detecting and removing those unusual observations, and performs better than the competing methods.  相似文献   

9.
In the era of Big Data, extracting the most important exploratory variables available in ultrahigh-dimensional data plays a key role in scientific researches. Existing researches have been mainly focusing on applying the extracted exploratory variables to describe the central tendency of their related response variables. For a response variable, its variability characteristic is as much important as the central tendency in statistical inference. This paper focuses on the variability and proposes a new model-free feature screening approach: sure explained variability and independence screening (SEVIS). The core of SEVIS is to take the advantage of recently proposed asymmetric and nonlinear generalised measures of correlation in the screening. Under some mild conditions, the paper shows that SEVIS not only possesses desired sure screening property and ranking consistency property, but also is a computational convenient variable selection method to deal with ultrahigh-dimensional data sets with more features than observations. The superior performance of SEVIS, compared with existing model-free methods, is illustrated in extensive simulations. A real example in ultrahigh-dimensional variable selection demonstrates that the variables selected by SEVIS better explain not only the response variables, but also the variables selected by other methods.  相似文献   

10.
Single index models are natural extensions of linear models and overcome the so-called curse of dimensionality. They are very useful for longitudinal data analysis. In this paper, we develop a new efficient estimation procedure for single index models with longitudinal data, based on Cholesky decomposition and local linear smoothing method. Asymptotic normality for the proposed estimators of both the parametric and nonparametric parts will be established. Monte Carlo simulation studies show excellent finite sample performance. Furthermore, we illustrate our methods with a real data example.  相似文献   

11.
Quantile regression is a flexible approach to assessing covariate effects on failure time, which has attracted considerable interest in survival analysis. When the dimension of covariates is much larger than the sample size, feature screening and variable selection become extremely important and indispensable. In this article, we introduce a new feature screening method for ultrahigh dimensional censored quantile regression. The proposed method can work for a general class of survival models, allow for heterogeneity of data and enjoy desirable properties including the sure screening property and the ranking consistency property. Moreover, an iterative version of screening algorithm has also been proposed to accommodate more complex situations. Monte Carlo simulation studies are designed to evaluate the finite sample performance under different model settings. We also illustrate the proposed methods through an empirical analysis.  相似文献   

12.
Fan J  Lv J 《Statistica Sinica》2010,20(1):101-148
High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods.  相似文献   

13.
Graphical models capture the conditional independence structure among random variables via existence of edges among vertices. One way of inferring a graph is to identify zero partial correlation coefficients, which is an effective way of finding conditional independence under a multivariate Gaussian setting. For more general settings, we propose kernel partial correlation which extends partial correlation with a combination of two kernel methods. First, a nonparametric function estimation is employed to remove effects from other variables, and then the dependence between remaining random components is assessed through a nonparametric association measure. The proposed approach is not only flexible but also robust under high levels of noise owing to the robustness of the nonparametric approaches.  相似文献   

14.
In this paper, we consider sure independence feature screening for ultrahigh dimensional discriminant analysis. We propose a new method named robust rank screening based on the conditional expectation of the rank of predictor’s samples. We also establish the sure screening property for the proposed procedure under simple assumptions. The new procedure has some additional desirable characters. First, it is robust against heavy-tailed distributions, potential outliers and the sample shortage for some categories. Second, it is model-free without any specification of a regression model and directly applicable to the situation with many categories. Third, it is simple in theoretical derivation due to the boundedness of the resulting statistics. Forth, it is relatively inexpensive in computational cost because of the simple structure of the screening index. Monte Carlo simulations and real data examples are used to demonstrate the finite sample performance.  相似文献   

15.
Ultra-high dimensional data arise in many fields of modern science, such as medical science, economics, genomics and imaging processing, and pose unprecedented challenge for statistical analysis. With such rapid-growth size of scientific data in various disciplines, feature screening becomes a primary step to reduce the high dimensionality to a moderate scale that can be handled by the existing penalized methods. In this paper, we introduce a simple and robust feature screening method without any model assumption to tackle high dimensional censored data. The proposed method is model-free and hence applicable to a general class of survival models. The sure screening and ranking consistency properties without any finite moment condition of the predictors and the response are established. The computation of the proposed method is rather straightforward. Finite sample performance of the newly proposed method is examined via extensive simulation studies. An application is illustrated with the gene association study of the mantle cell lymphoma.  相似文献   

16.
Symmetric kernel smoothing is commonly used in estimating the nonparametric component in the partial linear regression models. In this article, we propose a new estimation method for the partial linear regression models using the inverse Gaussian kernel when the explanatory variable of the nonparametric component is non-negatively supported. As an asymmetric kernel function, the inverse Gaussian kernel is also supported on the non-negative half line. The asymptotic properties, including the asymptotic normality, uniform almost sure convergence, and the iterated logarithm laws, of the proposed estimators are thoroughly discussed for both homoscedastic and heteroscedastic cases. The simulation study is conducted to evaluate the finite sample performance of the proposed estimators.  相似文献   

17.
The presence of outliers would inevitably lead to distorted analysis and inappropriate prediction, especially for multiple outliers in high-dimensional regression, where the high dimensionality of the data might amplify the chance of an observation or multiple observations being outlying. Noting that the detection of outliers is not only necessary but also important in high-dimensional regression analysis, we, in this paper, propose a feasible outlier detection approach in sparse high-dimensional linear regression model. Firstly, we search a clean subset by use of the sure independence screening method and the least trimmed square regression estimates. Then, we define a high-dimensional outlier detection measure and propose a multiple outliers detection approach through multiple testing procedures. In addition, to enhance efficiency, we refine the outlier detection rule after obtaining a relatively reliable non-outlier subset based on the initial detection approach. By comparison studies based on Monte Carlo simulation, it is shown that the proposed method performs well for detecting multiple outliers in sparse high-dimensional linear regression model. We further illustrate the application of the proposed method by empirical analysis of a real-life protein and gene expression data.  相似文献   

18.
Liu X  Wang L  Liang H 《Statistica Sinica》2011,21(3):1225-1248
Semiparametric additive partial linear models, containing both linear and nonlinear additive components, are more flexible compared to linear models, and they are more efficient compared to general nonparametric regression models because they reduce the problem known as "curse of dimensionality". In this paper, we propose a new estimation approach for these models, in which we use polynomial splines to approximate the additive nonparametric components and we derive the asymptotic normality for the resulting estimators of the parameters. We also develop a variable selection procedure to identify significant linear components using the smoothly clipped absolute deviation penalty (SCAD), and we show that the SCAD-based estimators of non-zero linear components have an oracle property. Simulations are performed to examine the performance of our approach as compared to several other variable selection methods such as the Bayesian Information Criterion and Least Absolute Shrinkage and Selection Operator (LASSO). The proposed approach is also applied to real data from a nutritional epidemiology study, in which we explore the relationship between plasma beta-carotene levels and personal characteristics (e.g., age, gender, body mass index (BMI), etc.) as well as dietary factors (e.g., alcohol consumption, smoking status, intake of cholesterol, etc.).  相似文献   

19.
High-dimensional sparse modeling with censored survival data is of great practical importance, as exemplified by applications in high-throughput genomic data analysis. In this paper, we propose a class of regularization methods, integrating both the penalized empirical likelihood and pseudoscore approaches, for variable selection and estimation in sparse and high-dimensional additive hazards regression models. When the number of covariates grows with the sample size, we establish asymptotic properties of the resulting estimator and the oracle property of the proposed method. It is shown that the proposed estimator is more efficient than that obtained from the non-concave penalized likelihood approach in the literature. Based on a penalized empirical likelihood ratio statistic, we further develop a nonparametric likelihood approach for testing the linear hypothesis of regression coefficients and constructing confidence regions consequently. Simulation studies are carried out to evaluate the performance of the proposed methodology and also two real data sets are analyzed.  相似文献   

20.
Variable selection for multivariate nonparametric regression is an important, yet challenging, problem due, in part, to the infinite dimensionality of the function space. An ideal selection procedure should be automatic, stable, easy to use, and have desirable asymptotic properties. In particular, we define a selection procedure to be nonparametric oracle (np-oracle) if it consistently selects the correct subset of predictors and at the same time estimates the smooth surface at the optimal nonparametric rate, as the sample size goes to infinity. In this paper, we propose a model selection procedure for nonparametric models, and explore the conditions under which the new method enjoys the aforementioned properties. Developed in the framework of smoothing spline ANOVA, our estimator is obtained via solving a regularization problem with a novel adaptive penalty on the sum of functional component norms. Theoretical properties of the new estimator are established. Additionally, numerous simulated and real examples further demonstrate that the new approach substantially outperforms other existing methods in the finite sample setting.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号