首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Quantile regression is a flexible approach to assessing covariate effects on failure time, which has attracted considerable interest in survival analysis. When the dimension of covariates is much larger than the sample size, feature screening and variable selection become extremely important and indispensable. In this article, we introduce a new feature screening method for ultrahigh dimensional censored quantile regression. The proposed method can work for a general class of survival models, allow for heterogeneity of data and enjoy desirable properties including the sure screening property and the ranking consistency property. Moreover, an iterative version of screening algorithm has also been proposed to accommodate more complex situations. Monte Carlo simulation studies are designed to evaluate the finite sample performance under different model settings. We also illustrate the proposed methods through an empirical analysis.  相似文献   

2.
With the recent explosion of scientific data of unprecedented size and complexity, feature ranking and screening are playing an increasingly important role in many scientific studies. In this article, we propose a novel feature screening procedure under a unified model framework, which covers a wide variety of commonly used parametric and semiparametric models. The new method does not require imposing a specific model structure on regression functions, and thus is particularly appealing to ultrahigh-dimensional regressions, where there are a huge number of candidate predictors but little information about the actual model forms. We demonstrate that, with the number of predictors growing at an exponential rate of the sample size, the proposed procedure possesses consistency in ranking, which is both useful in its own right and can lead to consistency in selection. The new procedure is computationally efficient and simple, and exhibits a competent empirical performance in our intensive simulations and real data analysis.  相似文献   

3.
Random forests are widely used in many research fields for prediction and interpretation purposes. Their popularity is rooted in several appealing characteristics, such as their ability to deal with high dimensional data, complex interactions and correlations between variables. Another important feature is that random forests provide variable importance measures that can be used to identify the most important predictor variables. Though there are alternatives like complete case analysis and imputation, existing methods for the computation of such measures cannot be applied straightforward when the data contains missing values. This paper presents a solution to this pitfall by introducing a new variable importance measure that is applicable to any kind of data—whether it does or does not contain missing values. An extensive simulation study shows that the new measure meets sensible requirements and shows good variable ranking properties. An application to two real data sets also indicates that the new approach may provide a more sensible variable ranking than the widespread complete case analysis. It takes the occurrence of missing values into account which makes results also differ from those obtained under multiple imputation.  相似文献   

4.
Ultra-high dimensional data arise in many fields of modern science, such as medical science, economics, genomics and imaging processing, and pose unprecedented challenge for statistical analysis. With such rapid-growth size of scientific data in various disciplines, feature screening becomes a primary step to reduce the high dimensionality to a moderate scale that can be handled by the existing penalized methods. In this paper, we introduce a simple and robust feature screening method without any model assumption to tackle high dimensional censored data. The proposed method is model-free and hence applicable to a general class of survival models. The sure screening and ranking consistency properties without any finite moment condition of the predictors and the response are established. The computation of the proposed method is rather straightforward. Finite sample performance of the newly proposed method is examined via extensive simulation studies. An application is illustrated with the gene association study of the mantle cell lymphoma.  相似文献   

5.
In the era of Big Data, extracting the most important exploratory variables available in ultrahigh-dimensional data plays a key role in scientific researches. Existing researches have been mainly focusing on applying the extracted exploratory variables to describe the central tendency of their related response variables. For a response variable, its variability characteristic is as much important as the central tendency in statistical inference. This paper focuses on the variability and proposes a new model-free feature screening approach: sure explained variability and independence screening (SEVIS). The core of SEVIS is to take the advantage of recently proposed asymmetric and nonlinear generalised measures of correlation in the screening. Under some mild conditions, the paper shows that SEVIS not only possesses desired sure screening property and ranking consistency property, but also is a computational convenient variable selection method to deal with ultrahigh-dimensional data sets with more features than observations. The superior performance of SEVIS, compared with existing model-free methods, is illustrated in extensive simulations. A real example in ultrahigh-dimensional variable selection demonstrates that the variables selected by SEVIS better explain not only the response variables, but also the variables selected by other methods.  相似文献   

6.
In this article, a new model-free feature screening method named after probability density (mass) function distance (PDFD) correlation is presented for ultrahigh-dimensional data analysis. We improve the fused-Kolmogorov filter (F-KOL) screening procedure through probability density distribution. The proposed method is also fully nonparametric and can be applied to more general types of predictors and responses, including discrete and continuous random variables. Kernel density estimate method and numerical integration are applied to obtain the estimator we proposed. The results of simulation studies indicate that the fused-PDFD performs better than other existing screening methods, such as F-KOL filter, sure-independent screening (SIS), sure independent ranking and screening (SIRS), distance correlation sure-independent screening (DCSIS) and robust ranking correlation screening (RRCS). Finally, we demonstrate the validity of fused-PDFD by a real data example.  相似文献   

7.
Most feature screening methods for ultrahigh-dimensional classification explicitly or implicitly assume the covariates are continuous. However, in the practice, it is quite common that both categorical and continuous covariates appear in the data, and applicable feature screening method is very limited. To handle this non-trivial situation, we propose an entropy-based feature screening method, which is model free and provides a unified screening procedure for both categorical and continuous covariates. We establish the sure screening and ranking consistency properties of the proposed procedure. We investigate the finite sample performance of the proposed procedure by simulation studies and illustrate the method by a real data analysis.  相似文献   

8.
In this paper, we propose a conditional quantile independence screening approach for ultra-high-dimensional heterogeneous data given some known, significant and low-dimensional variables. The new method does not require imposing a specific model structure for the response and covariates and can detect additional features that contribute to conditional quantiles of the response given those already-identified important predictors. We also prove that the proposed procedure enjoys the ranking consistency and sure screening properties. Some simulation studies are carried out to examine the performance of advised procedure. At last, we illustrate it by a real data example.  相似文献   

9.
In recent years, numerous feature screening schemes have been developed for ultra-high dimensional standard survival data with only one failure event. Nevertheless, existing literature pays little attention to related investigations for competing risks data, in which subjects suffer from multiple mutually exclusive failures. In this article, we develop a new marginal feature screening for ultra-high dimensional time-to-event data to allow for competing risks. The proposed procedure is model-free, and robust against heavy-tailed distributions and potential outliers for time to the type of failure of interest. Apart from this, it is invariant to any monotone transformation of event time of interest. Under rather mild assumptions, it is shown that the newly suggested approach possesses the ranking consistency and sure independence screening properties. Some numerical studies are conducted to evaluate the finite-sample performance of our method and make a comparison with its competitor, while an application to a real data set is provided to serve as an illustration.  相似文献   

10.
The Bradley–Terry model is widely and often beneficially used to rank objects from paired comparisons. The underlying assumption that makes ranking possible is the existence of a latent linear scale of merit or equivalently of a kind of transitiveness of the preference. However, in some situations such as sensory comparisons of products, this assumption can be unrealistic. In these contexts, although the Bradley–Terry model appears to be significantly interesting, the linear ranking does not make sense. Our aim is to propose a 2-dimensional extension of the Bradley–Terry model that accounts for interactions between the compared objects. From a methodological point of view, this proposition can be seen as a multidimensional scaling approach in the context of a logistic model for binomial data. Maximum likelihood is investigated and asymptotic properties are derived in order to construct confidence ellipses on the diagram of the 2-dimensional scores. It is shown by an illustrative example based on real sensory data on how to use the 2-dimensional model to inspect the lack-of-fit of the Bradley–Terry model.  相似文献   

11.
Supersaturated designs are factorial designs in which the number of potential effects is greater than the run size. They are commonly used in screening experiments, with the aim of identifying the dominant active factors with low cost. However, an important research field, which is poorly developed, is the analysis of such designs with non-normal response. In this article, we develop a variable selection strategy, through the modification of the PageRank algorithm, which is commonly used in the Google search engine for ranking Webpages. The proposed method incorporates an appropriate information theoretical measure into this algorithm and as a result, it can be efficiently used for factor screening. A noteworthy advantage of this procedure is that it allows the use of supersaturated designs for analyzing discrete data and therefore a generalized linear model is assumed. As it is depicted via a thorough simulation study, in which the Type I and Type II error rates are computed for a wide range of underlying models and designs, the presented approach can be considered quite advantageous and effective.  相似文献   

12.
In this paper, a generalized partially linear model (GPLM) with missing covariates is studied and a Monte Carlo EM (MCEM) algorithm with penalized-spline (P-spline) technique is developed to estimate the regression coefficients and nonparametric function, respectively. As classical model selection procedures such as Akaike's information criterion become invalid for our considered models with incomplete data, some new model selection criterions for GPLMs with missing covariates are proposed under two different missingness mechanism, say, missing at random (MAR) and missing not at random (MNAR). The most attractive point of our method is that it is rather general and can be extended to various situations with missing observations based on EM algorithm, especially when no missing data involved, our new model selection criterions are reduced to classical AIC. Therefore, we can not only compare models with missing observations under MAR/MNAR settings, but also can compare missing data models with complete-data models simultaneously. Theoretical properties of the proposed estimator, including consistency of the model selection criterions are investigated. A simulation study and a real example are used to illustrate the proposed methodology.  相似文献   

13.
This article is concerned with feature screening for the ultrahigh dimensional discriminant analysis. A variance ratio screening method is proposed and the sure screening property of this screening procedure is proved. The proposed method has some additional desirable features. First, it is model-free which does not require specific discriminant model and can be directly applied to the multi-categories situation. Second, it can effectively screen main effects and interaction effects simultaneously. Third, it is relatively inexpensive in computational cost because of the simple structure. The finite sample properties are performed through the Monte Carlo simulation studies and two real-data analyses.  相似文献   

14.
Latin hypercube designs (LHDs) are widely used in computer experiments because of their one-dimensional uniformity and other properties. Recently, a number of methods have been proposed to construct LHDs with properties that all linear effects are mutually orthogonal and orthogonal to all second-order effects, i.e., quadratic effects and bilinear interactions. This paper focuses on the construction of LHDs with the above desirable properties under the Fourier-polynomial model. A convenient and flexible algorithm for constructing such orthogonal LHDs is provided. Most of the resulting designs have different run sizes from that of Butler (2001), and thus are new and very suitable for factor screening and building Fourier-polynomial models in computer experiments as discussed in Butler (2001).  相似文献   

15.
Feature screening and variable selection are fundamental in analysis of ultrahigh-dimensional data, which are being collected in diverse scientific fields at relatively low cost. Distance correlation-based sure independence screening (DC-SIS) has been proposed to perform feature screening for ultrahigh-dimensional data. The DC-SIS possesses sure screening property and filters out unimportant predictors in a model-free manner. Like all independence screening methods, however, it fails to detect the truly important predictors which are marginally independent of the response variable due to correlations among predictors. When there are many irrelevant predictors which are highly correlated with some strongly active predictors, the independence screening may miss other active predictors with relatively weak marginal signals. To improve the performance of DC-SIS, we introduce an effective iterative procedure based on distance correlation to detect all truly important predictors and potentially interactions in both linear and nonlinear models. Thus, the proposed iterative method possesses the favourable model-free and robust properties. We further illustrate its excellent finite-sample performance through comprehensive simulation studies and an empirical analysis of the rat eye expression data set.  相似文献   

16.
This paper concerns model selection for autoregressive time series when the observations are contaminated with trend. We propose an adaptive least absolute shrinkage and selection operator (LASSO) type model selection method, in which the trend is estimated by B-splines, the detrended residuals are calculated, and then the residuals are used as if they were observations to optimize an adaptive LASSO type objective function. The oracle properties of such an adaptive LASSO model selection procedure are established; that is, the proposed method can identify the true model with probability approaching one as the sample size increases, and the asymptotic properties of estimators are not affected by the replacement of observations with detrended residuals. The intensive simulation studies of several constrained and unconstrained autoregressive models also confirm the theoretical results. The method is illustrated by two time series data sets, the annual U.S. tobacco production and annual tree ring width measurements.  相似文献   

17.
In the paper we consider minimisation of U-statistics with the weighted Lasso penalty and investigate their asymptotic properties in model selection and estimation. We prove that the use of appropriate weights in the penalty leads to the procedure that behaves like the oracle that knows the true model in advance, i.e. it is model selection consistent and estimates nonzero parameters with the standard rate. For the unweighted Lasso penalty, we obtain sufficient and necessary conditions for model selection consistency of estimators. The obtained results strongly based on the convexity of the loss function that is the main assumption of the paper. Our theorems can be applied to the ranking problem as well as generalised regression models. Thus, using U-statistics we can study more complex models (better describing real problems) than usually investigated linear or generalised linear models.  相似文献   

18.
Feature selection (FS) is one of the most powerful techniques to cope with the curse of dimensionality. In the study, a new filter approach to feature selection based on distance correlation is presented (DCFS, for short), which keeps the model-free advantage without any pre-specified parameters. Our method consists of two steps: hard step (forward selection) and soft step (backward selection). In the hard step, two types of associations, between univariate feature and the classes and between group feature and the classes, are involved to pick out the most relevant features with respect to the target classes. Due to the strict screening condition in the first step, some of the useful features are likely removed. Therefore, in the soft step, a feature-relationship gain (like feature score) based on the distance correlation is introduced, which is concerned with five kinds of associations. We sort the feature gain values and implement the backward selection procedure until the errors stop declining. The simulation results show that our method becomes more competitive on several datasets compared with some of the representative feature selection methods based on several classification models.  相似文献   

19.
It is quite a challenge to develop model‐free feature screening approaches for missing response problems because the existing standard missing data analysis methods cannot be applied directly to high dimensional case. This paper develops some novel methods by borrowing information of missingness indicators such that any feature screening procedures for ultrahigh‐dimensional covariates with full data can be applied to missing response case. The first method is the so‐called missing indicator imputation screening, which is developed by proving that the set of the active predictors of interest for the response is a subset of the active predictors for the product of the response and missingness indicator under some mild conditions. As an alternative, another method called Venn diagram‐based approach is also developed. The sure screening property is proven for both methods. It is shown that the complete case analysis can also keep the sure screening property of any feature screening approach with sure screening property.  相似文献   

20.
This paper is about variable selection with the random forests algorithm in presence of correlated predictors. In high-dimensional regression or classification frameworks, variable selection is a difficult task, that becomes even more challenging in the presence of highly correlated predictors. Firstly we provide a theoretical study of the permutation importance measure for an additive regression model. This allows us to describe how the correlation between predictors impacts the permutation importance. Our results motivate the use of the recursive feature elimination (RFE) algorithm for variable selection in this context. This algorithm recursively eliminates the variables using permutation importance measure as a ranking criterion. Next various simulation experiments illustrate the efficiency of the RFE algorithm for selecting a small number of variables together with a good prediction error. Finally, this selection algorithm is tested on the Landsat Satellite data from the UCI Machine Learning Repository.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号