首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, we propose a novel Max-Relevance and Min-Common-Redundancy criterion for variable selection in linear models. Considering that the ensemble approach for variable selection has been proven to be quite effective in linear regression models, we construct a variable selection ensemble (VSE) by combining the presented stochastic correlation coefficient algorithm with a stochastic stepwise algorithm. We conduct extensive experimental comparison of our algorithm and other methods using two simulation studies and four real-life data sets. The results confirm that the proposed VSE leads to promising improvement on variable selection and regression accuracy.  相似文献   

2.
3.
Feature screening and variable selection are fundamental in analysis of ultrahigh-dimensional data, which are being collected in diverse scientific fields at relatively low cost. Distance correlation-based sure independence screening (DC-SIS) has been proposed to perform feature screening for ultrahigh-dimensional data. The DC-SIS possesses sure screening property and filters out unimportant predictors in a model-free manner. Like all independence screening methods, however, it fails to detect the truly important predictors which are marginally independent of the response variable due to correlations among predictors. When there are many irrelevant predictors which are highly correlated with some strongly active predictors, the independence screening may miss other active predictors with relatively weak marginal signals. To improve the performance of DC-SIS, we introduce an effective iterative procedure based on distance correlation to detect all truly important predictors and potentially interactions in both linear and nonlinear models. Thus, the proposed iterative method possesses the favourable model-free and robust properties. We further illustrate its excellent finite-sample performance through comprehensive simulation studies and an empirical analysis of the rat eye expression data set.  相似文献   

4.
Classification of high-dimensional data set is a big challenge for statistical learning and data mining algorithms. To effectively apply classification methods to high-dimensional data sets, feature selection is an indispensable pre-processing step of learning process. In this study, we consider the problem of constructing an effective feature selection and classification scheme for data set which has a small number of sample size with a large number of features. A novel feature selection approach, named four-Staged Feature Selection, has been proposed to overcome high-dimensional data classification problem by selecting informative features. The proposed method first selects candidate features with number of filtering methods which are based on different metrics, and then it applies semi-wrapper, union and voting stages, respectively, to obtain final feature subsets. Several statistical learning and data mining methods have been carried out to verify the efficiency of the selected features. In order to test the adequacy of the proposed method, 10 different microarray data sets are employed due to their high number of features and small sample size.  相似文献   

5.
Based on various improved robust covariance estimators in the literature, several modified versions of the well-known correlated information criterion (CIC) for working intra-cluster correlation structure (ICS) selection are proposed. Performances of these modified criteria are examined and compared to the CIC via simulations. When the response is Gaussian, binary, or Poisson, the modified criteria are demonstrated to have higher detection rates when the true ICS is exchangeable, while the CIC would perform better when the true ICS is AR(1). An application of the criteria is made to a real dataset.  相似文献   

6.
7.
In this article, a new robust variable selection approach is introduced by combining the robust generalized estimating equations and adaptive LASSO penalty function for longitudinal generalized linear models. Then, an efficient weighted Gaussian pseudo-likelihood version of the BIC (WGBIC) is proposed to choose the tuning parameter in the process of robust variable selection and to select the best working correlation structure simultaneously. Meanwhile, the oracle properties of the proposed robust variable selection method are established and an efficient algorithm combining the iterative weighted least squares and minorization–maximization is proposed to implement robust variable selection and parameter estimation.  相似文献   

8.
Procedures are derived for selecting, with controlled probability of error, (1) a subset of populations which contains all populations better than a dual probability/proportion standard and (2) a subset of populations which both contains all populations better than an upper probability/proportion standard and also contains no populations worse than a lower probability/proportion standard. The procedures are motivated by current investigations in the area of computer performance evaluation.  相似文献   

9.
The minimum disparity estimators proposed by Lindsay (1994) for discrete models form an attractive subclass of minimum distance estimators which achieve their robustness without sacrificing first order efficiency at the model. Similarly, disparity test statistics are useful robust alternatives to the likelihood ratio test for testing of hypotheses in parametric models; they are asymptotically equivalent to the likelihood ratio test statistics under the null hypothesis and contiguous alternatives. Despite their asymptotic optimality properties, the small sample performance of many of the minimum disparity estimators and disparity tests can be considerably worse compared to the maximum likelihood estimator and the likelihood ratio test respectively. In this paper we focus on the class of blended weight Hellinger distances, a general subfamily of disparities, and study the effects of combining two different distances within this class to generate the family of “combined” blended weight Hellinger distances, and identify the members of this family which generally perform well. More generally, we investigate the class of "combined and penal-ized" blended weight Hellinger distances; the penalty is based on reweighting the empty cells, following Harris and Basu (1994). It is shown that some members of the combined and penalized family have rather attractive properties  相似文献   

10.
11.
征信机构采集到的所有微型企业信用信息变量并未都适合进行微型企业资信评估,文章设计了一种BP神经网络对此进行特征选择。该BP神经网络的训练基于前向序贯的特征选择算法,以输出层输出对输入值的灵敏度作为特征选择的依据,网络输出最小灵敏度对应的特征变量。通过设计概率神经网络对得到的结果进行仿真分析,信贷机构因此获得的利润比基于列联表分析的特征选择法高2/3。  相似文献   

12.
Agreement measures are designed to assess consistency between different instruments rating measurements of interest. When the individual responses are correlated with multilevel structure of nestings and clusters, traditional approaches are not readily available to estimate the inter- and intra-agreement for such complex multilevel settings. Our research stems from conformity evaluation between optometric devices with measurements on both eyes, equality tests of agreement in high myopic status between monozygous twins and dizygous twins, and assessment of reliability for different pathologists in dysplasia. In this paper, we focus on applying a Bayesian hierarchical correlation model incorporating adjustment for explanatory variables and nesting correlation structures to assess the inter- and intra-agreement through correlations of random effects for various sources. This Bayesian generalized linear mixed-effects model (GLMM) is further compared with the approximate intra-class correlation coefficients and kappa measures by the traditional Cohen’s kappa statistic and the generalized estimating equations (GEE) approach. The results of comparison studies reveal that the Bayesian GLMM provides a reliable and stable procedure in estimating inter- and intra-agreement simultaneously after adjusting for covariates and correlation structures, in marked contrast to Cohen’s kappa and the GEE approach.  相似文献   

13.
The Dirichlet process is a fundamental tool in studying Bayesian nonparametric inference. The Dirichlet process has several sum representations, where each one of these representations highlights some aspects of this important process. In this paper, we use the sum representations of the Dirichlet process to derive explicit expressions that are used to calculate Kolmogorov, Lévy, and Cramér–von Mises distances between the Dirichlet process and its base measure. The derived expressions of the distance are used to select a proper value for the concentration parameter of the Dirichlet process. These tools are also used in a goodness-of-fit test. Illustrative examples and simulation results are included.  相似文献   

14.
Energy statistics: A class of statistics based on distances   总被引:1,自引:0,他引:1  
Energy distance is a statistical distance between the distributions of random vectors, which characterizes equality of distributions. The name energy derives from Newton's gravitational potential energy, and there is an elegant relation to the notion of potential energy between statistical observations. Energy statistics are functions of distances between statistical observations in metric spaces. Thus even if the observations are complex objects, like functions, one can use their real valued nonnegative distances for inference. Theory and application of energy statistics are discussed and illustrated. Finally, we explore the notion of potential and kinetic energy of goodness-of-fit.  相似文献   

15.
The problem of selecting the correct subset of predictors within a linear model has received much attention in recent literature. Within the Bayesian framework, a popular choice of prior has been Zellner's gg-prior which is based on the inverse of empirical covariance matrix of the predictors. An extension of the Zellner's prior is proposed in this article which allow for a power parameter on the empirical covariance of the predictors. The power parameter helps control the degree to which correlated predictors are smoothed towards or away from one another. In addition, the empirical covariance of the predictors is used to obtain suitable priors over model space. In this manner, the power parameter also helps to determine whether models containing highly collinear predictors are preferred or avoided. The proposed power parameter can be chosen via an empirical Bayes method which leads to a data adaptive choice of prior. Simulation studies and a real data example are presented to show how the power parameter is well determined from the degree of cross-correlation within predictors. The proposed modification compares favorably to the standard use of Zellner's prior and an intrinsic prior in these examples.  相似文献   

16.
Recently, the ensemble learning approaches have been proven to be quite effective for variable selection in linear regression models. In general, a good variable selection ensemble should consist of a diverse collection of strong members. Based on the parallel genetic algorithm (PGA) proposed in [41 M. Zhu and H.A. Chipman, Darwinian evolution in parallel universes: A parallel genetic algorithm for variable selection, Technometrics 48(4) (2006), pp. 491502. doi: 10.1198/004017006000000093[Taylor &; Francis Online], [Web of Science ®] [Google Scholar]], in this paper, we propose a novel method RandGA through injecting randomness into PGA with the aim to increase the diversity among ensemble members. Using a number of simulated data sets, we show that the newly proposed method RandGA compares favorably with other variable selection techniques. As a real example, the new method is applied to the diabetes data.  相似文献   

17.
Rp of a linear regression model of the type Y = Xθ + ɛ, where X is the design matrix, Y the vector of the response variable and ɛ the random error vector that follows an AR(1) correlation structure. These estimators are asymptotically analyzed, by proving their strong consistency, asymptotic normality and asymptotic efficiency. In a simulation study, a better behaviour of the Mean Squared Error of the proposed estimator with respect to that of the generalized least squares estimators is observed. Received: November 16, 1998; revised version: May 10, 2000  相似文献   

18.
In this paper, we investigate the effect of pre-smoothing on model selection. Christóbal et al 6 Christóbal Christóbal, J. A., Faraldo Roca, P. and González Manteiga, W. 1987. A class of linear regression parameter estimators constructed by nonparametric estimation. Ann. Statist.,, 15: 603609. [Crossref], [Web of Science ®] [Google Scholar] showed the beneficial effect of pre-smoothing on estimating the parameters in a linear regression model. Here, in a regression setting, we show that smoothing the response data prior to model selection by Akaike's information criterion can lead to an improved selection procedure. The bootstrap is used to control the magnitude of the random error structure in the smoothed data. The effect of pre-smoothing on model selection is shown in simulations. The method is illustrated in a variety of settings, including the selection of the best fractional polynomial in a generalized linear model.  相似文献   

19.
Consider the problem of several mathematical models competing to explain an empirical phenomenon. it is argued that fundamental model selection questions of model discrimination, model correctness and model adequacy can be answered by measuring the discrepancy of each model from an estimate of the unknown data generating process. Formally, the framework adopted is that of Linhart and Zucchini (1986), who considered a very general definition of discrepancy. In the present paper, it is demonstrated that requiring a discrepancy to possess several very reasonable properties is in fact equivalent to requiring it to be a metric on the relevant space of probability distributions.  相似文献   

20.
In this article, we consider the problem of variable selection in linear regression when multicollinearity is present in the data. It is well known that in the presence of multicollinearity, performance of least square (LS) estimator of regression parameters is not satisfactory. Consequently, subset selection methods, such as Mallow's Cp, which are based on LS estimates lead to selection of inadequate subsets. To overcome the problem of multicollinearity in subset selection, a new subset selection algorithm based on the ridge estimator is proposed. It is shown that the new algorithm is a better alternative to Mallow's Cp when the data exhibit multicollinearity.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号