首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 500 毫秒
1.
ABSTRACT

Identifying homogeneous subsets of predictors in classification can be challenging in the presence of high-dimensional data with highly correlated variables. We propose a new method called cluster correlation-network support vector machine (CCNSVM) that simultaneously estimates clusters of predictors that are relevant for classification and coefficients of penalized SVM. The new CCN penalty is a function of the well-known Topological Overlap Matrix whose entries measure the strength of connectivity between predictors. CCNSVM implements an efficient algorithm that alternates between searching for predictors’ clusters and optimizing a penalized SVM loss function using Majorization–Minimization tricks and a coordinate descent algorithm. This combining of clustering and sparsity into a single procedure provides additional insights into the power of exploring dimension reduction structure in high-dimensional binary classification. Simulation studies are considered to compare the performance of our procedure to its competitors. A practical application of CCNSVM on DNA methylation data illustrates its good behaviour.  相似文献   

2.
Recent work has shown that the Lasso-based regularization is very useful for estimating the high-dimensional inverse covariance matrix. A particularly useful scheme is based on penalizing the ?1 norm of the off-diagonal elements to encourage sparsity. We embed this type of regularization into high-dimensional classification. A two-stage estimation procedure is proposed which first recovers structural zeros of the inverse covariance matrix and then enforces block sparsity by moving non-zeros closer to the main diagonal. We show that the block-diagonal approximation of the inverse covariance matrix leads to an additive classifier, and demonstrate that accounting for the structure can yield better performance accuracy. Effect of the block size on classification is explored, and a class of asymptotically equivalent structure approximations in a high-dimensional setting is specified. We suggest a variable selection at the block level and investigate properties of this procedure in growing dimension asymptotics. We present a consistency result on the feature selection procedure, establish asymptotic lower an upper bounds for the fraction of separative blocks and specify constraints under which the reliable classification with block-wise feature selection can be performed. The relevance and benefits of the proposed approach are illustrated on both simulated and real data.  相似文献   

3.
ABSTRACT

The last few years, the applications of Support Vector Machine (SVM) for solving classification and regression problems have been increasing, due to its high performance and ability to transform the non-linear relationships among variables to linear form by employing the kernel idea (kernel function). In this work, we develop a semi-parametric approach to fit single-index models to deal with high-dimensional problems. To achieve this goal, we use support vector regression (SVR) for estimating the unknown nonparametric link function, while the single-index is determined by using the semi-parametric least squares method (Ichimura 1993). This development enhances the ability of SVR to solve high-dimensional problem. We design a three simulation examples with high-dimensional problems (linear and nonlinear). The simulations demonstrate the superior performance of the proposed method versus the standard SVR method. This is further illustrated by applying the real data.  相似文献   

4.
We consider the problem of variable selection in high-dimensional partially linear models with longitudinal data. A variable selection procedure is proposed based on the smooth-threshold generalized estimating equation (SGEE). The proposed procedure automatically eliminates inactive predictors by setting the corresponding parameters to be zero, and simultaneously estimates the nonzero regression coefficients by solving the SGEE. We establish the asymptotic properties in a high-dimensional framework where the number of covariates pn increases as the number of clusters n increases. Extensive Monte Carlo simulation studies are conducted to examine the finite sample performance of the proposed variable selection procedure.  相似文献   

5.
For a multivariate linear model, Wilk's likelihood ratio test (LRT) constitutes one of the cornerstone tools. However, the computation of its quantiles under the null or the alternative hypothesis requires complex analytic approximations, and more importantly, these distributional approximations are feasible only for moderate dimension of the dependent variable, say p≤20. On the other hand, assuming that the data dimension p as well as the number q of regression variables are fixed while the sample size n grows, several asymptotic approximations are proposed in the literature for Wilk's Λ including the widely used chi-square approximation. In this paper, we consider necessary modifications to Wilk's test in a high-dimensional context, specifically assuming a high data dimension p and a large sample size n. Based on recent random matrix theory, the correction we propose to Wilk's test is asymptotically Gaussian under the null hypothesis and simulations demonstrate that the corrected LRT has very satisfactory size and power, surely in the large p and large n context, but also for moderately large data dimensions such as p=30 or p=50. As a byproduct, we give a reason explaining why the standard chi-square approximation fails for high-dimensional data. We also introduce a new procedure for the classical multiple sample significance test in multivariate analysis of variance which is valid for high-dimensional data.  相似文献   

6.
ABSTRACT

In high-dimensional regression, the presence of influential observations may lead to inaccurate analysis results so that it is a prime and important issue to detect these unusual points before statistical regression analysis. Most of the traditional approaches are, however, based on single-case diagnostics, and they may fail due to the presence of multiple influential observations that suffer from masking effects. In this paper, an adaptive multiple-case deletion approach is proposed for detecting multiple influential observations in the presence of masking effects in high-dimensional regression. The procedure contains two stages. Firstly, we propose a multiple-case deletion technique, and obtain an approximate clean subset of the data that is presumably free of influential observations. To enhance efficiency, in the second stage, we refine the detection rule. Monte Carlo simulation studies and a real-life data analysis investigate the effective performance of the proposed procedure.  相似文献   

7.
The L1-type regularization provides a useful tool for variable selection in high-dimensional regression modeling. Various algorithms have been proposed to solve optimization problems for L1-type regularization. Especially the coordinate descent algorithm has been shown to be effective in sparse regression modeling. Although the algorithm shows a remarkable performance to solve optimization problems for L1-type regularization, it suffers from outliers, since the procedure is based on the inner product of predictor variables and partial residuals obtained from a non-robust manner. To overcome this drawback, we propose a robust coordinate descent algorithm, especially focusing on the high-dimensional regression modeling based on the principal components space. We show that the proposed robust algorithm converges to the minimum value of its objective function. Monte Carlo experiments and real data analysis are conducted to examine the efficiency of the proposed robust algorithm. We observe that our robust coordinate descent algorithm effectively performs for the high-dimensional regression modeling even in the presence of outliers.  相似文献   

8.
Bayesian model averaging (BMA) is an effective technique for addressing model uncertainty in variable selection problems. However, current BMA approaches have computational difficulty dealing with data in which there are many more measurements (variables) than samples. This paper presents a method for combining ?1 regularization and Markov chain Monte Carlo model composition techniques for BMA. By treating the ?1 regularization path as a model space, we propose a method to resolve the model uncertainty issues arising in model averaging from solution path point selection. We show that this method is computationally and empirically effective for regression and classification in high-dimensional data sets. We apply our technique in simulations, as well as to some applications that arise in genomics.  相似文献   

9.
Recently, a new ensemble classification method named Canonical Forest (CF) has been proposed by Chen et al. [Canonical forest. Comput Stat. 2014;29:849–867]. CF has been proven to give consistently good results in many data sets and comparable to other widely used classification ensemble methods. However, CF requires an adopting feature reduction method before classifying high-dimensional data. Here, we extend CF to a high-dimensional classifier by incorporating a random feature subspace algorithm [Ho TK. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell. 1998;20:832–844]. This extended algorithm is called HDCF (high-dimensional CF) as it is specifically designed for high-dimensional data. We conducted an experiment using three data sets – gene imprinting, oestrogen, and leukaemia – to compare the performance of HDCF with several popular and successful classification methods on high-dimensional data sets, including Random Forest [Breiman L. Random forest. Mach Learn. 2001;45:5–32], CERP [Ahn H, et al. Classification by ensembles from random partitions of high-dimensional data. Comput Stat Data Anal. 2007;51:6166–6179], and support vector machines [Vapnik V. The nature of statistical learning theory. New York: Springer; 1995]. Besides the classification accuracy, we also investigated the balance between sensitivity and specificity for all these four classification methods.  相似文献   

10.
Classification of gene expression microarray data is important in the diagnosis of diseases such as cancer, but often the analysis of microarray data presents difficult challenges because the gene expression dimension is typically much larger than the sample size. Consequently, classification methods for microarray data often rely on regularization techniques to stabilize the classifier for improved classification performance. In particular, numerous regularization techniques, such as covariance-matrix regularization, are available, which, in practice, lead to a difficult choice of regularization methods. In this paper, we compare the classification performance of five covariance-matrix regularization methods applied to the linear discriminant function using two simulated high-dimensional data sets and five well-known, high-dimensional microarray data sets. In our simulation study, we found the minimum distance empirical Bayes method reported in Srivastava and Kubokawa [Comparison of discrimination methods for high dimensional data, J. Japan Statist. Soc. 37(1) (2007), pp. 123–134], and the new linear discriminant analysis reported in Thomaz, Kitani, and Gillies [A Maximum Uncertainty LDA-based approach for Limited Sample Size problems – with application to Face Recognition, J. Braz. Comput. Soc. 12(1) (2006), pp. 1–12], to perform consistently well and often outperform three other prominent regularization methods. Finally, we conclude with some recommendations for practitioners.  相似文献   

11.
The paper considers a significance test of regression variables in the high-dimensional linear regression model when the dimension of the regression variables p, together with the sample size n, tends to infinity. Under two sightly different cases, we proved that the likelihood ratio test statistic will converge in distribution to a Gaussian random variable, and the explicit expressions of the asymptotical mean and covariance are also obtained. The simulations demonstrate that our high-dimensional likelihood ratio test method outperforms those using the traditional methods in analyzing high-dimensional data.  相似文献   

12.
Under non-normality, this article is concerned with testing diagonality of high-dimensional covariance matrix, which is more practical than testing sphericity and identity in high-dimensional setting. The existing testing procedure for diagonality is not robust against either the data dimension or the data distribution, producing tests with distorted type I error rates much larger than nominal levels. This is mainly due to bias from estimating some functions of high-dimensional covariance matrix under non-normality. Compared to the sphericity and identity hypotheses, the asymptotic property of the diagonality hypothesis would be more involved and we should be more careful to deal with bias. We develop a correction that makes the existing test statistic robust against both the data dimension and the data distribution. We show that the proposed test statistic is asymptotically normal without the normality assumption and without specifying an explicit relationship between the dimension p and the sample size n. Simulations show that it has good size and power for a wide range of settings.  相似文献   

13.
ABSTRACT

A common Bayesian hierarchical model is where high-dimensional observed data depend on high-dimensional latent variables that, in turn, depend on relatively few hyperparameters. When the full conditional distribution over latent variables has a known form, general MCMC sampling need only be performed on the low-dimensional marginal posterior distribution over hyperparameters. This improves on popular Gibbs sampling that computes over the full space. Sampling the marginal posterior over hyperparameters exhibits good scaling of compute cost with data size, particularly when that distribution depends on a low-dimensional sufficient statistic.  相似文献   

14.
Abstract

There has been much attention on the high-dimensional linear regression models, which means the number of observations is much less than that of covariates. Considering the fact that the high dimensionality often induces the collinearity problem, in this article, we study the penalized quantile regression with the elastic net (EnetQR) that combines the strengths of the quadratic regularization and the lasso shrinkage. We investigate the weak oracle property of the EnetQR under mild conditions in the high dimensional setting. Moreover, we propose a two-step procedure, called adaptive elastic net quantile regression (AEnetQR), in which the weight vector in the second step is constructed from the EnetQR estimate in the first step. This two-step procedure is justified theoretically to possess the weak oracle property. The finite sample properties are performed through the Monte Carlo simulation and a real-data analysis.  相似文献   

15.
Abstract

A nonparametric procedure is proposed to estimate multiple change-points of location changes in a univariate data sequence by using ranks instead of the raw data. While existing rank-based multiple change-point detection methods are mostly based on sequential tests, we treat it as a model selection problem. We derive the corresponding Schwarz’s information criterion for rank-statistics, theoretically prove the consistency of the change-point estimator and use a pruned dynamic programing algorithm to achieve the change-point estimator. Simulation studies show our method’s robustness, effectiveness and efficiency in detecting mean-changes. We also apply the method to a gene dataset as an illustration.  相似文献   

16.
There are few distribution-free methods for detecting interaction in fixed-dose trials involving quantal response data, despite the fact that such trials are common. We present three new tests to address this issue, including a simple bootstrap procedure. We examine the power of the likelihood ratio test and our new bootstrap test statistic using an innovative linear extrapolation power-estimation technique described in Boos, D. D. and Zhang, J. (2000) in Monte Carlo evaluation of resampling-based hypothesis tests. Journal of the American Statistical Association, 95, 486–492.  相似文献   

17.
ABSTRACT

We propose an extension of parametric product partition models. We name our proposal nonparametric product partition models because we associate a random measure instead of a parametric kernel to each set within a random partition. Our methodology does not impose any specific form on the marginal distribution of the observations, allowing us to detect shifts of behaviour even when dealing with heavy-tailed or skewed distributions. We propose a suitable loss function and find the partition of the data having minimum expected loss. We then apply our nonparametric procedure to multiple change-point analysis and compare it with PPMs and with other methodologies that have recently appeared in the literature. Also, in the context of missing data, we exploit the product partition structure in order to estimate the distribution function of each missing value, allowing us to detect change points using the loss function mentioned above. Finally, we present applications to financial as well as genetic data.  相似文献   

18.
ABSTRACT

This article presents a Bayesian analysis of the von Mises–Fisher distribution, which is the most important distribution in the analysis of directional data. We obtain samples from the posterior distribution using a sampling-importance-resampling method. The procedure is illustrated using simulated data as well as real data sets previously analyzed in the literature.  相似文献   

19.
ABSTRACT

Incremental modelling of data streams is of great practical importance, as shown by its applications in advertising and financial data analysis. We propose two incremental covariance matrix decomposition methods for a compositional data type. The first method, exact incremental covariance decomposition of compositional data (C-EICD), gives an exact decomposition result. The second method, covariance-free incremental covariance decomposition of compositional data (C-CICD), is an approximate algorithm that can efficiently compute high-dimensional cases. Based on these two methods, many frequently used compositional statistical models can be incrementally calculated. We take multiple linear regression and principle component analysis as examples to illustrate the utility of the proposed methods via extensive simulation studies.  相似文献   

20.
ABSTRACT

In this article, we study the estimation for a class of semiparametric mixtures of generalized linear models where mixing proportions depend on a covariate non parametrically. We investigate a backfitting estimation procedure and show the asymptotic normality of the proposed estimators under mild conditions. We conduct simulation to show the good performance of our methodology and give a real data analysis as an illustration.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号