首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We consider statistical procedures for feature selection defined by a family of regularization problems with convex piecewise linear loss functions and penalties of l 1 nature. Many known statistical procedures (e.g. quantile regression and support vector machines with l 1-norm penalty) are subsumed under this category. Computationally, the regularization problems are linear programming (LP) problems indexed by a single parameter, which are known as ‘parametric cost LP’ or ‘parametric right-hand-side LP’ in the optimization theory. Exploiting the connection with the LP theory, we lay out general algorithms, namely, the simplex algorithm and its variant for generating regularized solution paths for the feature selection problems. The significance of such algorithms is that they allow a complete exploration of the model space along the paths and provide a broad view of persistent features in the data. The implications of the general path-finding algorithms are outlined for several statistical procedures, and they are illustrated with numerical examples.  相似文献   

2.
Feature selection (FS) is one of the most powerful techniques to cope with the curse of dimensionality. In the study, a new filter approach to feature selection based on distance correlation is presented (DCFS, for short), which keeps the model-free advantage without any pre-specified parameters. Our method consists of two steps: hard step (forward selection) and soft step (backward selection). In the hard step, two types of associations, between univariate feature and the classes and between group feature and the classes, are involved to pick out the most relevant features with respect to the target classes. Due to the strict screening condition in the first step, some of the useful features are likely removed. Therefore, in the soft step, a feature-relationship gain (like feature score) based on the distance correlation is introduced, which is concerned with five kinds of associations. We sort the feature gain values and implement the backward selection procedure until the errors stop declining. The simulation results show that our method becomes more competitive on several datasets compared with some of the representative feature selection methods based on several classification models.  相似文献   

3.
In many conventional scientific investigations with high or ultra-high dimensional feature spaces, the relevant features, though sparse, are large in number compared with classical statistical problems, and the magnitude of their effects tapers off. It is reasonable to model the number of relevant features as a diverging sequence when sample size increases. In this paper, we investigate the properties of the extended Bayes information criterion (EBIC) (Chen and Chen, 2008) for feature selection in linear regression models with diverging number of relevant features in high or ultra-high dimensional feature spaces. The selection consistency of the EBIC in this situation is established. The application of EBIC to feature selection is considered in a SCAD cum EBIC procedure. Simulation studies are conducted to demonstrate the performance of the SCAD cum EBIC procedure in finite sample cases.  相似文献   

4.
A new variational Bayesian (VB) algorithm, split and eliminate VB (SEVB), for modeling data via a Gaussian mixture model (GMM) is developed. This new algorithm makes use of component splitting in a way that is more appropriate for analyzing a large number of highly heterogeneous spiky spatial patterns with weak prior information than existing VB-based approaches. SEVB is a highly computationally efficient approach to Bayesian inference and like any VB-based algorithm it can perform model selection and parameter value estimation simultaneously. A significant feature of our algorithm is that the fitted number of components is not limited by the initial proposal giving increased modeling flexibility. We introduce two types of split operation in addition to proposing a new goodness-of-fit measure for evaluating mixture models. We evaluate their usefulness through empirical studies. In addition, we illustrate the utility of our new approach in an application on modeling human mobility patterns. This application involves large volumes of highly heterogeneous spiky data; it is difficult to model this type of data well using the standard VB approach as it is too restrictive and lacking in the flexibility required. Empirical results suggest that our algorithm has also improved upon the goodness-of-fit that would have been achieved using the standard VB method, and that it is also more robust to various initialization settings.  相似文献   

5.
In this paper, we consider that the degradation of two performance characteristics of a product can be modelled by stochastic processes and jointly by copula functions, but different stochastic processes govern the behaviour of each performance characteristic (PC) degradation. Different heterogeneous and homogeneous models are presented considering copula functions and different combinations of the most used stochastic processes in degradation analysis as marginal distributions. This is an important aspect to consider because the behaviour of the degradation of each PC may be different in its nature. As the joint distributions of the proposed models result in complex distributions, the estimation of the parameters of interest is performed via MCMC. A simulation study is performed to compare heterogeneous and homogeneous models. In addition, the proposed models are implemented to crack propagation data of two terminals of an electronic device, and some insights are provided about the product reliability under heterogeneous models.  相似文献   

6.
In this paper, we focus on the feature extraction and variable selection of massive data which is divided and stored in different linked computers. Specifically, we study the distributed model selection with the Smoothly Clipped Absolute Deviation (SCAD) penalty. Based on the Alternating Direction Method of Multipliers (ADMM) algorithm, we propose distributed SCAD algorithm and prove its convergence. The results of variable selection of the distributed approach are same with the results of the non-distributed approach. Numerical studies show that our method is both effective and efficient which performs well in distributed data analysis.  相似文献   

7.
Classification of high-dimensional data set is a big challenge for statistical learning and data mining algorithms. To effectively apply classification methods to high-dimensional data sets, feature selection is an indispensable pre-processing step of learning process. In this study, we consider the problem of constructing an effective feature selection and classification scheme for data set which has a small number of sample size with a large number of features. A novel feature selection approach, named four-Staged Feature Selection, has been proposed to overcome high-dimensional data classification problem by selecting informative features. The proposed method first selects candidate features with number of filtering methods which are based on different metrics, and then it applies semi-wrapper, union and voting stages, respectively, to obtain final feature subsets. Several statistical learning and data mining methods have been carried out to verify the efficiency of the selected features. In order to test the adequacy of the proposed method, 10 different microarray data sets are employed due to their high number of features and small sample size.  相似文献   

8.
This is a comparative study of various clustering and classification algorithms as applied to differentiate cancer and non-cancer protein samples using mass spectrometry data. Our study demonstrates the usefulness of a feature selection step prior to applying a machine learning tool. A natural and common choice of a feature selection tool is the collection of marginal p-values obtained from t-tests for testing the intensity differences at each m/z ratio in the cancer versus non-cancer samples. We study the effect of selecting a cutoff in terms of the overall Type 1 error rate control on the performance of the clustering and classification algorithms using the significant features. For the classification problem, we also considered m/z selection using the importance measures computed by the Random Forest algorithm of Breiman. Using a data set of proteomic analysis of serum from ovarian cancer patients and serum from cancer-free individuals in the Food and Drug Administration and National Cancer Institute Clinical Proteomics Database, we undertake a comparative study of the net effect of the machine learning algorithm–feature selection tool–cutoff criteria combination on the performance as measured by an appropriate error rate measure.  相似文献   

9.
Purposive sampling is described as a random selection of sampling units within the segment of the population with the most information on the characteristic of interest. Nonparametric bootstrap is proposed in estimating location parameters and the corresponding variances. An estimate of bias and a measure of variance of the point estimate are computed using the Monte Carlo method. The bootstrap estimator of the population mean is efficient and consistent in the homogeneous, heterogeneous, and two-segment populations simulated. The design-unbiased approximation of the standard error estimate differs substantially from the bootstrap estimate in severely heterogeneous and positively skewed populations.  相似文献   

10.
The mixture distribution models are more useful than pure distributions in modeling of heterogeneous data sets. The aim of this paper is to propose mixture of Weibull–Poisson (WP) distributions to model heterogeneous data sets for the first time. So, a powerful alternative mixture distribution is created for modeling of the heterogeneous data sets. In the study, many features of the proposed mixture of WP distributions are examined. Also, the expectation maximization (EM) algorithm is used to determine the maximum-likelihood estimates of the parameters, and the simulation study is conducted for evaluating the performance of the proposed EM scheme. Applications for two real heterogeneous data sets are given to show the flexibility and potentiality of the new mixture distribution.  相似文献   

11.
In multi-stage sampling with the first stage units (fsu) chosen without replacement (WOR) with varying probability schemes (VPS) unbiased estimators (UE) of variances of homogeneous linear (HL) functions of unbiased estimators (UE) Ti's of fsu totals Yi's based on selection of subsequent stage units (SSU) from chosen fsu's are derived as homogeneous quadratic (HQ) functions of alternative less efficient UE's, say of Ti';'s of Yi's. Specific strategies are illustrated.  相似文献   

12.
In the framework of redundancy analysis and reduced rank regression, the extended redundancy analysis model managed to account for more than two blocks of manifest variables in its specification. A further extension, the generalized redundancy analysis (GRA), has been recently proposed in literature, with the aim of incorporating external covariates into the model, thanks to a new estimation algorithm that manages to separate all the contributions of the exogenous and external covariates in the formation of the latent composites. At present, software to estimate GRA models is not available. In this paper, we provide an SAS macro, %GRA, to specify and fit structural relationships, with an application to illustrate the use of the macro.  相似文献   

13.
We generalize the Gaussian mixture transition distribution (GMTD) model introduced by Le and co-workers to the mixture autoregressive (MAR) model for the modelling of non-linear time series. The models consist of a mixture of K stationary or non-stationary AR components. The advantages of the MAR model over the GMTD model include a more full range of shape changing predictive distributions and the ability to handle cycles and conditional heteroscedasticity in the time series. The stationarity conditions and autocorrelation function are derived. The estimation is easily done via a simple EM algorithm and the model selection problem is addressed. The shape changing feature of the conditional distributions makes these models capable of modelling time series with multimodal conditional distributions and with heteroscedasticity. The models are applied to two real data sets and compared with other competing models. The MAR models appear to capture features of the data better than other competing models do.  相似文献   

14.
This paper is about variable selection with the random forests algorithm in presence of correlated predictors. In high-dimensional regression or classification frameworks, variable selection is a difficult task, that becomes even more challenging in the presence of highly correlated predictors. Firstly we provide a theoretical study of the permutation importance measure for an additive regression model. This allows us to describe how the correlation between predictors impacts the permutation importance. Our results motivate the use of the recursive feature elimination (RFE) algorithm for variable selection in this context. This algorithm recursively eliminates the variables using permutation importance measure as a ranking criterion. Next various simulation experiments illustrate the efficiency of the RFE algorithm for selecting a small number of variables together with a good prediction error. Finally, this selection algorithm is tested on the Landsat Satellite data from the UCI Machine Learning Repository.  相似文献   

15.
We propose Twin Boosting which has much better feature selection behavior than boosting, particularly with respect to reducing the number of false positives (falsely selected features). In addition, for cases with a few important effective and many noise features, Twin Boosting also substantially improves the predictive accuracy of boosting. Twin Boosting is as general and generic as (gradient-based) boosting. It can be used with general weak learners and in a wide variety of situations, including generalized regression, classification or survival modeling. Furthermore, it is computationally feasible for large problems with potentially many more features than observed samples. Finally, for the special case of orthonormal linear models, we prove equivalence of Twin Boosting to the adaptive Lasso which provides some theoretical aspects on feature selection with Twin Boosting.  相似文献   

16.
征信机构采集到的所有微型企业信用信息变量并未都适合进行微型企业资信评估,文章设计了一种BP神经网络对此进行特征选择。该BP神经网络的训练基于前向序贯的特征选择算法,以输出层输出对输入值的灵敏度作为特征选择的依据,网络输出最小灵敏度对应的特征变量。通过设计概率神经网络对得到的结果进行仿真分析,信贷机构因此获得的利润比基于列联表分析的特征选择法高2/3。  相似文献   

17.
Feature selection often constitutes one of the central aspects of many scientific investigations. Among the methodologies for feature selection in penalized regression, the smoothly clipped and absolute deviation seems to be very useful because it satisfies the oracle property. However, its estimation algorithms such as the local quadratic approximation and the concave–convex procedure are not computationally efficient. In this paper, we propose an efficient penalization path algorithm. Through numerical examples on real and simulated data, we illustrate that our path algorithm can be useful for feature selection in regression problems.  相似文献   

18.
核主成分分析KPCA是近年来提出的一个十分有效的数据降维方法,但它并不能保证所提取的第一主成分最适用于降维后的数据分类。粗糙集RS理论是处理这类问题的一个有效方法。提出一个基于KPCA与RS理论的支持向量分类机SVC,利用RS理论和信息熵原理对运用KP(A进行特征提取后的训练样本进行特征选择,保留重要特征,力求减小求解问题的规模,提高SVC的性能。在构建2006年上市公司财务困境预警模型的数值实验中,以KPCA、RS理论作为前置系统的SVC取得了良好效果。  相似文献   

19.
Let X be a random n-vector whose density function is given by a mixture of known multivariate normal density functions where the corresponding mixture proportions (a priori probabilities) are unknown. We present a numerically tractable method for obtaining estimates of the mixture proportions based on the linear feature selection technique of Guseman, Peters and Walker (1975).  相似文献   

20.
Consider two parallel systems with their independent components’ lifetimes following heterogeneous exponentiated generalized gamma distributions, where the heterogeneity is in both shape and scale parameters. We then obtain the usual stochastic (reversed hazard rate) order between the lifetimes of two systems by using the weak submajorization order between the vectors of shape parameters and the p-larger (weak supermajorization) order between the vectors of scale parameters, under some restrictions on the involved parameters. Further, by reducing the heterogeneity of parameters in each system, the usual stochastic (reversed hazard rate) order mentioned above is strengthened to the hazard rate (likelihood ratio) order. Finally, two characterization results concerning the comparisons of two parallel systems, one with independent heterogeneous generalized exponential components and another with independent homogeneous generalized exponential components, are derived. These characterization results enable us to find some lower and upper bounds for the hazard rate and reversed hazard rate functions of a parallel system consisting of independent heterogeneous generalized exponential components. The results established here generalize some of the known results in the literature, concerning the comparisons of parallel systems under generalized exponential and exponentiated Weibull models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号