首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
In this paper, we propose a conditional quantile independence screening approach for ultra-high-dimensional heterogeneous data given some known, significant and low-dimensional variables. The new method does not require imposing a specific model structure for the response and covariates and can detect additional features that contribute to conditional quantiles of the response given those already-identified important predictors. We also prove that the proposed procedure enjoys the ranking consistency and sure screening properties. Some simulation studies are carried out to examine the performance of advised procedure. At last, we illustrate it by a real data example.  相似文献   

2.
This paper is concerned with the stable feature screening for the ultrahigh dimensional data. To deal with the ultrahigh dimensional data problem and screen the important features, a set-averaging measurement is proposed. The model averaging technique and the conditional quantile method are used to construct the weighted set-averaging feature screening procedure to identify the relationships between the possible predictors and the response variable. The proposed screening method is model free, stable and possesses the sure screening property under some regular conditions. Some Monte Carlo simulations and a real data application are conducted to evaluate the performance of the proposed procedure.  相似文献   

3.
In the era of Big Data, extracting the most important exploratory variables available in ultrahigh-dimensional data plays a key role in scientific researches. Existing researches have been mainly focusing on applying the extracted exploratory variables to describe the central tendency of their related response variables. For a response variable, its variability characteristic is as much important as the central tendency in statistical inference. This paper focuses on the variability and proposes a new model-free feature screening approach: sure explained variability and independence screening (SEVIS). The core of SEVIS is to take the advantage of recently proposed asymmetric and nonlinear generalised measures of correlation in the screening. Under some mild conditions, the paper shows that SEVIS not only possesses desired sure screening property and ranking consistency property, but also is a computational convenient variable selection method to deal with ultrahigh-dimensional data sets with more features than observations. The superior performance of SEVIS, compared with existing model-free methods, is illustrated in extensive simulations. A real example in ultrahigh-dimensional variable selection demonstrates that the variables selected by SEVIS better explain not only the response variables, but also the variables selected by other methods.  相似文献   

4.
This article is concerned with feature screening for the ultrahigh dimensional discriminant analysis. A variance ratio screening method is proposed and the sure screening property of this screening procedure is proved. The proposed method has some additional desirable features. First, it is model-free which does not require specific discriminant model and can be directly applied to the multi-categories situation. Second, it can effectively screen main effects and interaction effects simultaneously. Third, it is relatively inexpensive in computational cost because of the simple structure. The finite sample properties are performed through the Monte Carlo simulation studies and two real-data analyses.  相似文献   

5.
Feature screening and variable selection are fundamental in analysis of ultrahigh-dimensional data, which are being collected in diverse scientific fields at relatively low cost. Distance correlation-based sure independence screening (DC-SIS) has been proposed to perform feature screening for ultrahigh-dimensional data. The DC-SIS possesses sure screening property and filters out unimportant predictors in a model-free manner. Like all independence screening methods, however, it fails to detect the truly important predictors which are marginally independent of the response variable due to correlations among predictors. When there are many irrelevant predictors which are highly correlated with some strongly active predictors, the independence screening may miss other active predictors with relatively weak marginal signals. To improve the performance of DC-SIS, we introduce an effective iterative procedure based on distance correlation to detect all truly important predictors and potentially interactions in both linear and nonlinear models. Thus, the proposed iterative method possesses the favourable model-free and robust properties. We further illustrate its excellent finite-sample performance through comprehensive simulation studies and an empirical analysis of the rat eye expression data set.  相似文献   

6.
Most feature screening methods for ultrahigh-dimensional classification explicitly or implicitly assume the covariates are continuous. However, in the practice, it is quite common that both categorical and continuous covariates appear in the data, and applicable feature screening method is very limited. To handle this non-trivial situation, we propose an entropy-based feature screening method, which is model free and provides a unified screening procedure for both categorical and continuous covariates. We establish the sure screening and ranking consistency properties of the proposed procedure. We investigate the finite sample performance of the proposed procedure by simulation studies and illustrate the method by a real data analysis.  相似文献   

7.
Quantile regression is a flexible approach to assessing covariate effects on failure time, which has attracted considerable interest in survival analysis. When the dimension of covariates is much larger than the sample size, feature screening and variable selection become extremely important and indispensable. In this article, we introduce a new feature screening method for ultrahigh dimensional censored quantile regression. The proposed method can work for a general class of survival models, allow for heterogeneity of data and enjoy desirable properties including the sure screening property and the ranking consistency property. Moreover, an iterative version of screening algorithm has also been proposed to accommodate more complex situations. Monte Carlo simulation studies are designed to evaluate the finite sample performance under different model settings. We also illustrate the proposed methods through an empirical analysis.  相似文献   

8.
It is quite a challenge to develop model‐free feature screening approaches for missing response problems because the existing standard missing data analysis methods cannot be applied directly to high dimensional case. This paper develops some novel methods by borrowing information of missingness indicators such that any feature screening procedures for ultrahigh‐dimensional covariates with full data can be applied to missing response case. The first method is the so‐called missing indicator imputation screening, which is developed by proving that the set of the active predictors of interest for the response is a subset of the active predictors for the product of the response and missingness indicator under some mild conditions. As an alternative, another method called Venn diagram‐based approach is also developed. The sure screening property is proven for both methods. It is shown that the complete case analysis can also keep the sure screening property of any feature screening approach with sure screening property.  相似文献   

9.
For ultrahigh-dimensional data, independent feature screening has been demonstrated both theoretically and empirically to be an effective dimension reduction method with low computational demanding. Motivated by the Buckley–James method to accommodate censoring, we propose a fused Kolmogorov–Smirnov filter to screen out the irrelevant dependent variables for ultrahigh-dimensional survival data. The proposed model-free screening method can work with many types of covariates (e.g. continuous, discrete and categorical variables) and is shown to enjoy the sure independent screening property under mild regularity conditions without requiring any moment conditions on covariates. In particular, the proposed procedure can still be powerful when covariates are strongly dependent on each other. We further develop an iterative algorithm to enhance the performance of our method while dealing with the practical situations where some covariates may be marginally unrelated but jointly related to the response. We conduct extensive simulations to evaluate the finite-sample performance of the proposed method, showing that it has favourable exhibition over the existing typical methods. As an illustration, we apply the proposed method to the diffuse large-B-cell lymphoma study.  相似文献   

10.
Case‐cohort design has been demonstrated to be an economical and efficient approach in large cohort studies when the measurement of some covariates on all individuals is expensive. Various methods have been proposed for case‐cohort data when the dimension of covariates is smaller than sample size. However, limited work has been done for high‐dimensional case‐cohort data which are frequently collected in large epidemiological studies. In this paper, we propose a variable screening method for ultrahigh‐dimensional case‐cohort data under the framework of proportional model, which allows the covariate dimension increases with sample size at exponential rate. Our procedure enjoys the sure screening property and the ranking consistency under some mild regularity conditions. We further extend this method to an iterative version to handle the scenarios where some covariates are jointly important but are marginally unrelated or weakly correlated to the response. The finite sample performance of the proposed procedure is evaluated via both simulation studies and an application to a real data from the breast cancer study.  相似文献   

11.
In this paper, we consider sure independence feature screening for ultrahigh dimensional discriminant analysis. We propose a new method named robust rank screening based on the conditional expectation of the rank of predictor’s samples. We also establish the sure screening property for the proposed procedure under simple assumptions. The new procedure has some additional desirable characters. First, it is robust against heavy-tailed distributions, potential outliers and the sample shortage for some categories. Second, it is model-free without any specification of a regression model and directly applicable to the situation with many categories. Third, it is simple in theoretical derivation due to the boundedness of the resulting statistics. Forth, it is relatively inexpensive in computational cost because of the simple structure of the screening index. Monte Carlo simulations and real data examples are used to demonstrate the finite sample performance.  相似文献   

12.
We introduce a two-step procedure, in the context of ultra-high dimensional additive models, which aims to reduce the size of covariates vector and distinguish linear and nonlinear effects among nonzero components. Our proposed screening procedure, in the first step, is constructed based on the concept of cumulative distribution function and conditional expectation of response in the framework of marginal correlation. B-splines and empirical distribution functions are used to estimate the two above measures. The sure screening property of this procedure is also established. In the second step, a double penalization based procedure is applied to identify nonzero and linear components, simultaneously. The performance of the designed method is examined by several test functions to show its capabilities against competitor methods when the distribution of errors is varied. Simulation studies imply that the proposed screening procedure can be applied to the ultra-high dimensional data and well detect the influential covariates. It also demonstrate the superiority in comparison with the existing methods. This method is also applied to identify most influential genes for overexpression of a G protein-coupled receptor in mice.  相似文献   

13.
We consider the problem of variable screening in ultra-high-dimensional generalized linear models (GLMs) of nonpolynomial orders. Since the popular SIS approach is extremely unstable in the presence of contamination and noise, we discuss a new robust screening procedure based on the minimum density power divergence estimator (MDPDE) of the marginal regression coefficients. Our proposed screening procedure performs well under pure and contaminated data scenarios. We provide a theoretical motivation for the use of marginal MDPDEs for variable screening from both population as well as sample aspects; in particular, we prove that the marginal MDPDEs are uniformly consistent leading to the sure screening property of our proposed algorithm. Finally, we propose an appropriate MDPDE-based extension for robust conditional screening in GLMs along with the derivation of its sure screening property. Our proposed methods are illustrated through extensive numerical studies along with an interesting real data application.  相似文献   

14.
In this paper, we develop a conditional model for analyzing mixed bivariate continuous and ordinal longitudinal responses. We propose a quantile regression model with random effects for analyzing continuous responses. For this purpose, an Asymmetric Laplace Distribution (ALD) is allocated for continuous response given random effects. For modeling ordinal responses, a cumulative logit model is used, via specifying a latent variable model, with considering other random effects. Therefore, the intra-association between continuous and ordinal responses is taken into account using their own exclusive random effects. But, the inter-association between two mixed responses is taken into account by adding a continuous response term in the ordinal model. We use a Bayesian approach via Markov chain Monte Carlo method for analyzing the proposed conditional model and to estimate unknown parameters, a Gibbs sampler algorithm is used. Moreover, we illustrate an application of the proposed model using a part of the British Household Panel Survey data set. The results of data analysis show that gender, age, marital status, educational level and the amount of money spent on leisure have significant effects on annual income. Also, the associated parameter is significant in using the best fitting proposed conditional model, thus it should be employed rather than analyzing separate models.  相似文献   

15.
Ultra-high dimensional data arise in many fields of modern science, such as medical science, economics, genomics and imaging processing, and pose unprecedented challenge for statistical analysis. With such rapid-growth size of scientific data in various disciplines, feature screening becomes a primary step to reduce the high dimensionality to a moderate scale that can be handled by the existing penalized methods. In this paper, we introduce a simple and robust feature screening method without any model assumption to tackle high dimensional censored data. The proposed method is model-free and hence applicable to a general class of survival models. The sure screening and ranking consistency properties without any finite moment condition of the predictors and the response are established. The computation of the proposed method is rather straightforward. Finite sample performance of the newly proposed method is examined via extensive simulation studies. An application is illustrated with the gene association study of the mantle cell lymphoma.  相似文献   

16.
ABSTRACT

In this article, a finite mixture model of hurdle Poisson distribution with missing outcomes is proposed, and a stochastic EM algorithm is developed for obtaining the maximum likelihood estimates of model parameters and mixing proportions. Specifically, missing data is assumed to be missing not at random (MNAR)/non ignorable missing (NINR) and the corresponding missingness mechanism is modeled through probit regression. To improve the algorithm efficiency, a stochastic step is incorporated into the E-step based on data augmentation, whereas the M-step is solved by the method of conditional maximization. A variation on Bayesian information criterion (BIC) is also proposed to compare models with different number of components with missing values. The considered model is a general model framework and it captures the important characteristics of count data analysis such as zero inflation/deflation, heterogeneity as well as missingness, providing us with more insight into the data feature and allowing for dispersion to be investigated more fully and correctly. Since the stochastic step only involves simulating samples from some standard distributions, the computational burden is alleviated. Once missing responses and latent variables are imputed to replace the conditional expectation, our approach works as part of a multiple imputation procedure. A simulation study and a real example illustrate the usefulness and effectiveness of our methodology.  相似文献   

17.
In many studies a large number of variables is measured and the identification of relevant variables influencing an outcome is an important task. For variable selection several procedures are available. However, focusing on one model only neglects that there usually exist other equally appropriate models. Bayesian or frequentist model averaging approaches have been proposed to improve the development of a predictor. With a larger number of variables (say more than ten variables) the resulting class of models can be very large. For Bayesian model averaging Occam’s window is a popular approach to reduce the model space. As this approach may not eliminate any variables, a variable screening step was proposed for a frequentist model averaging procedure. Based on the results of selected models in bootstrap samples, variables are eliminated before deriving a model averaging predictor. As a simple alternative screening procedure backward elimination can be used. Through two examples and by means of simulation we investigate some properties of the screening step. In the simulation study we consider situations with fifteen and 25 variables, respectively, of which seven have an influence on the outcome. With the screening step most of the uninfluential variables will be eliminated, but also some variables with a weak effect. Variable screening leads to more applicable models without eliminating models, which are more strongly supported by the data. Furthermore, we give recommendations for important parameters of the screening step.  相似文献   

18.
Supersaturated designs are factorial designs in which the number of potential effects is greater than the run size. They are commonly used in screening experiments, with the aim of identifying the dominant active factors with low cost. However, an important research field, which is poorly developed, is the analysis of such designs with non-normal response. In this article, we develop a variable selection strategy, through the modification of the PageRank algorithm, which is commonly used in the Google search engine for ranking Webpages. The proposed method incorporates an appropriate information theoretical measure into this algorithm and as a result, it can be efficiently used for factor screening. A noteworthy advantage of this procedure is that it allows the use of supersaturated designs for analyzing discrete data and therefore a generalized linear model is assumed. As it is depicted via a thorough simulation study, in which the Type I and Type II error rates are computed for a wide range of underlying models and designs, the presented approach can be considered quite advantageous and effective.  相似文献   

19.
In an attempt to provide a statistical tool for disease screening and prediction, we propose a semiparametric approach to analysis of the Cox proportional hazards cure model in situations where the observations on the event time are subject to right censoring and some covariates are missing not at random. To facilitate the methodological development, we begin with semiparametric maximum likelihood estimation (SPMLE) assuming that the (conditional) distribution of the missing covariates is known. A variant of the EM algorithm is used to compute the estimator. We then adapt the SPMLE to a more practical situation where the distribution is unknown and there is a consistent estimator based on available information. We establish the consistency and weak convergence of the resulting pseudo-SPMLE, and identify a suitable variance estimator. The application of our inference procedure to disease screening and prediction is illustrated via empirical studies. The proposed approach is used to analyze the tuberculosis screening study data that motivated this research. Its finite-sample performance is examined by simulation.  相似文献   

20.
Variable screening for censored survival data is most challenging when both survival and censoring times are correlated with an ultrahigh-dimensional vector of covariates. Existing approaches to handling censoring often make use of inverse probability weighting by assuming independent censoring with both survival time and covariates. This is a convenient but rather restrictive assumption which may be unmet in real applications, especially when the censoring mechanism is complex and the number of covariates is large. To accommodate heterogeneous (covariate-dependent) censoring that is often present in high-dimensional survival data, we propose a Gehan-type rank screening method to select features that are relevant to the survival time. The method is invariant to monotone transformations of the response and of the predictors, and works robustly for a general class of survival models. We establish the sure screening property of the proposed methodology. Simulation studies and a lymphoma data analysis demonstrate its favorable performance and practical utility.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号