共查询到20条相似文献,搜索用时 15 毫秒
1.
Classification of gene expression microarray data is important in the diagnosis of diseases such as cancer, but often the analysis of microarray data presents difficult challenges because the gene expression dimension is typically much larger than the sample size. Consequently, classification methods for microarray data often rely on regularization techniques to stabilize the classifier for improved classification performance. In particular, numerous regularization techniques, such as covariance-matrix regularization, are available, which, in practice, lead to a difficult choice of regularization methods. In this paper, we compare the classification performance of five covariance-matrix regularization methods applied to the linear discriminant function using two simulated high-dimensional data sets and five well-known, high-dimensional microarray data sets. In our simulation study, we found the minimum distance empirical Bayes method reported in Srivastava and Kubokawa [Comparison of discrimination methods for high dimensional data, J. Japan Statist. Soc. 37(1) (2007), pp. 123–134], and the new linear discriminant analysis reported in Thomaz, Kitani, and Gillies [A Maximum Uncertainty LDA-based approach for Limited Sample Size problems – with application to Face Recognition, J. Braz. Comput. Soc. 12(1) (2006), pp. 1–12], to perform consistently well and often outperform three other prominent regularization methods. Finally, we conclude with some recommendations for practitioners. 相似文献
2.
在典型相关分析中,求得典型相关变量的表达式并没有全部完成任务,例如需要确定典型相关变量的个数和变量选择。针对典型相关变量的个数问题,发现了常用的卡方检验和冗余分析方法的不足,进而提出了一种新的算法。针对原始变量的选择问题,提出了三种可能的路径。最后利用人体尺寸数据对相关结论进行了验证。 相似文献
3.
Mr Norm A Campbell 《Journal of applied statistics》1979,6(1):7-18
Canonical variate analysis can be viewed as a two-stage principal component analysis. Explicit consideration of the principal components from the first stage, formalized in the content of shrunken estimators, leads to a number of practical advantages. In morphometric studies, the first eigenvector is often a size vector, with the remaining vectors contrast or shape-type vectors, so that a decomposition of the canonical variates into size and shape components can be achieved. In applied studies, often a small number of the principal components effect most of the separation between groups; plots of group means and associated concentration ellipses (ideally these should be circular) for important principal components facilitate graphical inspection. Of considerable practical importance is the potential for improved stability of the estimated canonical vectors. When the between-groups sum of squares for a particular principal component is small, and the corresponding eigenvalue of the within-groups correlation matrix is also small, marked instability of the canonical vectors can be expected. The introduction of shrunken estimators, by adding shrinkage constrants to the eigenvalues, leads to more stable coefficients. 相似文献
4.
Michael J. Brusco Clay M. Voorhees Roger J. Calantone Michael K. Brady Douglas Steinley 《统计学通讯:模拟与计算》2019,48(6):1623-1636
We propose a hybrid two-group classification method that integrates linear discriminant analysis, a polynomial expansion of the basis (or variable space), and a genetic algorithm with multiple crossover operations to select variables from the expanded basis. Using new product launch data from the biochemical industry, we found that the proposed algorithm offers mean percentage decreases in the misclassification error rate of 50%, 56%, 59%, 77%, and 78% in comparison to a support vector machine, artificial neural network, quadratic discriminant analysis, linear discriminant analysis, and logistic regression, respectively. These improvements correspond to annual cost savings of $4.40–$25.73 million. 相似文献
5.
This paper discusses a supervised classification approach for the differential diagnosis of Raynaud's phenomenon (RP). The classification of data from healthy subjects and from patients suffering for primary and secondary RP is obtained by means of a set of classifiers derived within the framework of linear discriminant analysis. A set of functional variables and shape measures extracted from rewarming/reperfusion curves are proposed as discriminant features. Since the prediction of group membership is based on a large number of these features, the high dimension/small sample size problem is considered to overcome the singularity problem of the within-group covariance matrix. Results on a data set of 72 subjects demonstrate that a satisfactory classification of the subjects can be achieved through the proposed methodology. 相似文献
6.
We study the design problem for the optimal classification of functional data. The goal is to select sampling time points so that functional data observed at these time points can be classified accurately. We propose optimal designs that are applicable to either dense or sparse functional data. Using linear discriminant analysis, we formulate our design objectives as explicit functions of the sampling points. We study the theoretical properties of the proposed design objectives and provide a practical implementation. The performance of the proposed design is evaluated through simulations and real data applications. The Canadian Journal of Statistics 48: 285–307; 2020 © 2019 Statistical Society of Canada 相似文献
7.
Asymptotic Optimality of Sparse Linear Discriminant Analysis with Arbitrary Number of Classes 下载免费PDF全文
Many sparse linear discriminant analysis (LDA) methods have been proposed to overcome the major problems of the classic LDA in high‐dimensional settings. However, the asymptotic optimality results are limited to the case with only two classes. When there are more than two classes, the classification boundary is complicated and no explicit formulas for the classification errors exist. We consider the asymptotic optimality in the high‐dimensional settings for a large family of linear classification rules with arbitrary number of classes. Our main theorem provides easy‐to‐check criteria for the asymptotic optimality of a general classification rule in this family as dimensionality and sample size both go to infinity and the number of classes is arbitrary. We establish the corresponding convergence rates. The general theory is applied to the classic LDA and the extensions of two recently proposed sparse LDA methods to obtain the asymptotic optimality. 相似文献
8.
Anant M. Kshirsagar Thomas M. Davis 《Australian & New Zealand Journal of Statistics》1983,25(3):467-481
Khatri (1966) has derived a Wilks' s A test of a general linear hypothesis in the growth curve model. In this paper we give the direction and collinearity factors and their null distributions when the hypothesis is not true but the noncentrality matrix is of rank one. Interpretation of these tests and their usefulness in discrimination in growth curve models are discussed. 相似文献
9.
Antonio Carlos Gonçalves Renan M.V.R. Almeida Marcos Pereira Estellita Lins 《Journal of applied statistics》2013,40(5):1032-1043
This work investigates the use of canonical correlation analysis (CCA) in the definition of weight restrictions for data envelopment analysis (DEA). With this purpose, CCA limits are introduced into Wong and Beasley's DEA model. An application of the method is made over data from hospitals in 27 Brazilian cities, producing as outputs average payment (average admission values) and percentage of hospital admissions according to disease groups (International Classification of Diseases, 9th Edition), and having as inputs mortality rates and average stay (length of stay after admission (days)). In this application, performance scores were calculated for both the (CCA) restricted and unrestricted DEA models. It can be concluded that the use of CCA-based weight limits for DEA models increases the consistency of the estimated DEA scores (more homogenous weights) and that these limits do not present mathematical infeasibility problems while avoiding the need for subjectively restricting weight variation in DEA. 相似文献
10.
Helle Sørensen Anders Tolver Maj Halling Thomsen Pia Haubro Andersen 《Journal of applied statistics》2012,39(2):337-360
This paper presents a study on symmetry of repeated bi-phased data signals, in particular, on quantification of the deviation between the two parts of the signal. Three symmetry scores are defined using functional data techniques such as smoothing and registration. One score is related to the L 2-distance between the two parts of the signal, whereas the other two are constructed to specifically measure differences in amplitude and phase. Moreover, symmetry scores based on functional principal component analysis (PCA) are examined. The scores are applied to acceleration signals from a study on equine gait. The scores turn out to be highly associated with lameness, and their applicability for lameness quantification and detection is investigated. Four classification approaches turn out to give similar results. The scores describing amplitude and phase variation turn out to outperform the PCA scores when it comes to the classification of lameness. 相似文献
11.
Canonical correlation analysis (CCA) is often used to analyze the correlation between two random vectors. However, sometimes interpretation of CCA results may be hard. In an attempt to address these difficulties, principal canonical correlation analysis (PCCA) was proposed. PCCA is CCA between two sets of principal component (PC) scores. We consider the problem of selecting useful PC scores in CCA. A variable selection criterion for one set of PC scores has been proposed by Ogura (2010), here, we propose a variable selection criterion for two sets of PC scores in PCCA. Furthermore, we demonstrate the effectiveness of this criterion. 相似文献
12.
Jeffrey M. Albert Anant M. Kshirsagar 《Australian & New Zealand Journal of Statistics》1993,35(3):345-357
This paper presents a method of discriminant analysis especially suited to longitudinal data. The approach is in the spirit of canonical variate analysis (CVA) and is similarly intended to reduce the dimensionality of multivariate data while retaining information about group differences. A drawback of CVA is that it does not take advantage of special structures that may be anticipated in certain types of data. For longitudinal data, it is often appropriate to specify a growth curve structure (as given, for example, in the model of Potthoff & Roy, 1964). The present paper focuses on this growth curve structure, utilizing it in a model-based approach to discriminant analysis. For this purpose the paper presents an extension of the reduced-rank regression model, referred to as the reduced-rank growth curve (RRGC) model. It estimates discriminant functions via maximum likelihood and gives a procedure for determining dimensionality. This methodology is exploratory only, and is illustrated by a well-known dataset from Grizzle & Allen (1969). 相似文献
13.
We compare the performance of recently developed regularized covariance matrix estimators for Markowitz's portfolio optimization and of the minimum variance portfolio (MVP) problem in particular. We focus on seven estimators that are applied to the MVP problem in the literature; three regularize the eigenvalues of the sample covariance matrix, and the other four assume the sparsity of the true covariance matrix or its inverse. Comparisons are made with two sets of long-term S&P 500 stock return data that represent two extreme scenarios of active and passive management. The results show that the MVPs with sparse covariance estimators have high Sharpe ratios but that the naive diversification (also known as the ‘uniform (on market share) portfolio’) still performs well in terms of wealth growth. 相似文献
14.
用于分类的随机森林和Bagging分类树比较 总被引:5,自引:0,他引:5
借助试验数据,从两种理论分析角度解释随机森林算法优于Bagging分类树算法的原因。将两种算法表述在两种不同的框架下,消除了这两种算法分析中的一些模糊之处。尤其在第二种分析框架下,更能清楚的看出,之所以随机森林算法优于Bagging分类树算法,是因为随机森林算法对应更小的偏差。 相似文献
15.
Keith E. Muller 《The American statistician》2013,67(4):342-354
Canonical correlation has been little used and little understood, even by otherwise sophisticated analysts. An alternative approach to canonical correlation, based on a general linear multivariate model, is presented. Properties of principal component analysis are used to help explain the method. Standard computational methods for full rank canonical correlation, techniques for canonical correlation on component scores, and canonical correlation with less than full rank are discussed. They are seen to be essentially equivalent when the model equation for canonical correlation on component scores is presented. The two approaches to less than full rank situations are equivalent in some senses, but quite different in usefulness, depending on the application. An example dataset is analyzed in detail to help demonstrate the conclusions. 相似文献
16.
Philippe Casin 《Journal of applied statistics》2018,45(8):1396-1409
Techniques of credit scoring have been developed these last years in order to reduce the risk taken by banks and financial institutions in the loans that they are granting. Credit Scoring is a classification problem of individuals in one of the two following groups: defaulting borrowers or non-defaulting borrowers. The aim of this paper is to propose a new method of discrimination when the dependent variable is categorical and when a large number of categorical explanatory variables are retained. This method, Categorical Multiblock Linear Discriminant Analysis, computes components which take into account both relationships between explanatory categorical variables and canonical correlation between each explanatory categorical variable and the dependent variable. A comparison with three other techniques and an application on credit scoring data are provided. 相似文献
17.
广东省第三产业经济与旅游经济的典型相关对比分析 总被引:1,自引:0,他引:1
改革开放以来,广东经济得到了前所未有的发展,第三产业经济和旅游经济迅猛壮大。第三产业的蓬勃发展,为旅游经济提供了坚实的物质基础;反过来,广东省各地在发展旅游业的同时又与当地经济相结合,带动了当地经济腾飞,促进了广东经济的发展。通过对1998年和2003年广东省各地区第三产业经济和旅游经济的典型相关对比分析,揭示了第三产业经济和旅游经济二者的典型相关关系变化,为制定相应政策提供依据。 相似文献
18.
In this paper, we apply empirical likelihood for two-sample problems with growing high dimensionality. Our results are demonstrated for constructing confidence regions for the difference of the means of two p-dimensional samples and the difference in value between coefficients of two p-dimensional sample linear model. We show that empirical likelihood based estimator has the efficient property. That is, as p → ∞ for high-dimensional data, the limit distribution of the EL ratio statistic for the difference of the means of two samples and the difference in value between coefficients of two-sample linear model is asymptotic normal distribution. Furthermore, empirical likelihood (EL) gives efficient estimator for regression coefficients in linear models, and can be as efficient as a parametric approach. The performance of the proposed method is illustrated via numerical simulations. 相似文献
19.
Britta Anker Bak Jens Ledet Jensen Morten Fenger‐Grøn 《Scandinavian Journal of Statistics》2015,42(1):32-42
We consider classification in the situation of two groups with normally distributed data in the ‘large p small n’ framework. To counterbalance the high number of variables, we consider the thresholded independence rule. An upper bound on the classification error is established that is taylored to a mean value of interest in biological applications. 相似文献
20.
Kai Xu 《Journal of Statistical Computation and Simulation》2017,87(16):3208-3224
Under non-normality, this article is concerned with testing diagonality of high-dimensional covariance matrix, which is more practical than testing sphericity and identity in high-dimensional setting. The existing testing procedure for diagonality is not robust against either the data dimension or the data distribution, producing tests with distorted type I error rates much larger than nominal levels. This is mainly due to bias from estimating some functions of high-dimensional covariance matrix under non-normality. Compared to the sphericity and identity hypotheses, the asymptotic property of the diagonality hypothesis would be more involved and we should be more careful to deal with bias. We develop a correction that makes the existing test statistic robust against both the data dimension and the data distribution. We show that the proposed test statistic is asymptotically normal without the normality assumption and without specifying an explicit relationship between the dimension p and the sample size n. Simulations show that it has good size and power for a wide range of settings. 相似文献