期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A comparison of regularization methods applied to the linear discriminant function with high-dimensional microarray data

John A. Ramey Phil D. Young 《Journal of Statistical Computation and Simulation》2013,83(3):581-596

Classification of gene expression microarray data is important in the diagnosis of diseases such as cancer, but often the analysis of microarray data presents difficult challenges because the gene expression dimension is typically much larger than the sample size. Consequently, classification methods for microarray data often rely on regularization techniques to stabilize the classifier for improved classification performance. In particular, numerous regularization techniques, such as covariance-matrix regularization, are available, which, in practice, lead to a difficult choice of regularization methods. In this paper, we compare the classification performance of five covariance-matrix regularization methods applied to the linear discriminant function using two simulated high-dimensional data sets and five well-known, high-dimensional microarray data sets. In our simulation study, we found the minimum distance empirical Bayes method reported in Srivastava and Kubokawa [Comparison of discrimination methods for high dimensional data, J. Japan Statist. Soc. 37(1) (2007), pp. 123–134], and the new linear discriminant analysis reported in Thomaz, Kitani, and Gillies [A Maximum Uncertainty LDA-based approach for Limited Sample Size problems – with application to Face Recognition, J. Braz. Comput. Soc. 12(1) (2006), pp. 1–12], to perform consistently well and often outperform three other prominent regularization methods. Finally, we conclude with some recommendations for practitioners. 相似文献

2.

典型相关分析的延拓研究

杜子芳常志勇《统计与信息论坛》2014,(5):3-7

在典型相关分析中,求得典型相关变量的表达式并没有全部完成任务,例如需要确定典型相关变量的个数和变量选择。针对典型相关变量的个数问题,发现了常用的卡方检验和冗余分析方法的不足,进而提出了一种新的算法。针对原始变量的选择问题,提出了三种可能的路径。最后利用人体尺寸数据对相关结论进行了验证。相似文献

3.

Some Practical Aspects of Canonical Variate Analysis

Mr Norm A Campbell 《Journal of applied statistics》1979,6(1):7-18

Canonical variate analysis can be viewed as a two-stage principal component analysis. Explicit consideration of the principal components from the first stage, formalized in the content of shrunken estimators, leads to a number of practical advantages. In morphometric studies, the first eigenvector is often a size vector, with the remaining vectors contrast or shape-type vectors, so that a decomposition of the canonical variates into size and shape components can be achieved. In applied studies, often a small number of the principal components effect most of the separation between groups; plots of group means and associated concentration ellipses (ideally these should be circular) for important principal components facilitate graphical inspection. Of considerable practical importance is the potential for improved stability of the estimated canonical vectors. When the between-groups sum of squares for a particular principal component is small, and the corresponding eigenvalue of the within-groups correlation matrix is also small, marked instability of the canonical vectors can be expected. The introduction of shrunken estimators, by adding shrinkage constrants to the eigenvalues, leads to more stable coefficients. 相似文献

4.

Michael J. Brusco Clay M. Voorhees Roger J. Calantone Michael K. Brady Douglas Steinley 《统计学通讯:模拟与计算》2019,48(6):1623-1636

We propose a hybrid two-group classification method that integrates linear discriminant analysis, a polynomial expansion of the basis (or variable space), and a genetic algorithm with multiple crossover operations to select variables from the expanded basis. Using new product launch data from the biochemical industry, we found that the proposed algorithm offers mean percentage decreases in the misclassification error rate of 50%, 56%, 59%, 77%, and 78% in comparison to a support vector machine, artificial neural network, quadratic discriminant analysis, linear discriminant analysis, and logistic regression, respectively. These improvements correspond to annual cost savings of $4.40–$25.73 million. 相似文献

5.

Ci‐Ren Jiang Lu‐Hung Chen 《Wiley Interdisciplinary Reviews: Computational Statistics》2020,12(4)

Because of its many practical applications, classifying functional data has received considerable attention over the last decades. Most classification approaches for functional data are extended from those for multivariate data. During the extension, two strategies, namely filtering and regularization, have commonly been employed to tackle the issues raised by the fact that functional data are intrinsically infinite‐dimensional. Because of space limitations, we focus on the filtering methods in this review. This article is categorized under:

Statistical and Graphical Methods of Data Analysis > Analysis of High Dimensional Data
Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification

相似文献

6.

Sandra E. Safo Qi Long 《Statistical Analysis and Data Mining》2019,12(2):56-69

Classification with high‐dimensional variables is a popular goal in many modern statistical studies. Fisher's linear discriminant analysis (LDA) is a common and effective tool for classifying entities into existing groups. It is well known that classification using Fisher's discriminant for high‐dimensional data is as bad as random guessing because of the use of many noise features, which increases the misclassification rate. Recently, it is being acknowledged that complex biological mechanisms occur through multiple features working together, though individually these features may contribute to noise accumulation in the data. In view of these, it is important to perform classification with discriminant vectors that use a subset of important variables, while also utilizing prior biological relationships among features. We tackle this problem in this paper and propose methods that incorporate variable selection into the classification problem for the identification of important biomarkers. Furthermore, we incorporate into the LDA problem prior information on the relationships among variables using undirected graphs in order to identify functionally meaningful biomarkers. We compare our methods with existing sparse LDA approaches via simulation studies and real data analysis. 相似文献

7.

Classification of biomedical signals for differential diagnosis of Raynaud's phenomenon

Luigi Ippoliti Simone Di Zio Arcangelo Merla 《Journal of applied statistics》2014,41(8):1830-1847

This paper discusses a supervised classification approach for the differential diagnosis of Raynaud's phenomenon (RP). The classification of data from healthy subjects and from patients suffering for primary and secondary RP is obtained by means of a set of classifiers derived within the framework of linear discriminant analysis. A set of functional variables and shape measures extracted from rewarming/reperfusion curves are proposed as discriminant features. Since the prediction of group membership is based on a large number of these features, the high dimension/small sample size problem is considered to overcome the singularity problem of the within-group covariance matrix. Results on a data set of 72 subjects demonstrate that a satisfactory classification of the subjects can be achieved through the proposed methodology. 相似文献

8.

Cai Li Luo Xiao 《Revue canadienne de statistique》2020,48(2):285-307

We study the design problem for the optimal classification of functional data. The goal is to select sampling time points so that functional data observed at these time points can be classified accurately. We propose optimal designs that are applicable to either dense or sparse functional data. Using linear discriminant analysis, we formulate our design objectives as explicit functions of the sampling points. We study the theoretical properties of the proposed design objectives and provide a practical implementation. The performance of the proposed design is evaluated through simulations and real data applications. The Canadian Journal of Statistics 48: 285–307; 2020 © 2019 Statistical Society of Canada 相似文献

9.

《Journal of nonparametric statistics》2012,24(1):165-183

This paper gives a theoretical analysis of high-dimensional linear discrimination of Gaussian data. We study the excess risk of linear discriminant rules. We emphasis the poor performances of standard procedures in the case when dimension p is larger than sample size n. The corresponding theoretical results are non-asymptotic lower bounds. On the other hand, we propose two discrimination procedures based on dimensionality reduction and provide associated rates of convergence which can be O(log(p)/n) under sparsity assumptions. Finally, all our results rely on a theorem that provides simple sharp relations between the excess risk and an estimation error associated with the geometric parameters defining the used discrimination rule. 相似文献

10.

Ruiyan Luo Xin Qi 《Scandinavian Journal of Statistics》2017,44(3):598-616

Many sparse linear discriminant analysis (LDA) methods have been proposed to overcome the major problems of the classic LDA in high‐dimensional settings. However, the asymptotic optimality results are limited to the case with only two classes. When there are more than two classes, the classification boundary is complicated and no explicit formulas for the classification errors exist. We consider the asymptotic optimality in the high‐dimensional settings for a large family of linear classification rules with arbitrary number of classes. Our main theorem provides easy‐to‐check criteria for the asymptotic optimality of a general classification rule in this family as dimensionality and sample size both go to infinity and the number of classes is arbitrary. We establish the corresponding convergence rates. The general theory is applied to the classic LDA and the extensions of two recently proposed sparse LDA methods to obtain the asymptotic optimality. 相似文献

11.

DIRECTION AND COLLINEARITY FACTORS OF WILK'S A ASSOCIATED WITH THE GROWTH CURVE MODEL

Anant M. Kshirsagar Thomas M. Davis 《Australian & New Zealand Journal of Statistics》1983,25(3):467-481

Khatri (1966) has derived a Wilks' s A test of a general linear hypothesis in the growth curve model. In this paper we give the direction and collinearity factors and their null distributions when the hypothesis is not true but the noncentrality matrix is of rank one. Interpretation of these tests and their usefulness in discrimination in growth curve models are discussed. 相似文献

12.

Canonical correlation analysis in the definition of weight restrictions for data envelopment analysis

Antonio Carlos Gonçalves Renan M.V.R. Almeida Marcos Pereira Estellita Lins 《Journal of applied statistics》2013,40(5):1032-1043

This work investigates the use of canonical correlation analysis (CCA) in the definition of weight restrictions for data envelopment analysis (DEA). With this purpose, CCA limits are introduced into Wong and Beasley's DEA model. An application of the method is made over data from hospitals in 27 Brazilian cities, producing as outputs average payment (average admission values) and percentage of hospital admissions according to disease groups (International Classification of Diseases, 9th Edition), and having as inputs mortality rates and average stay (length of stay after admission (days)). In this application, performance scores were calculated for both the (CCA) restricted and unrestricted DEA models. It can be concluded that the use of CCA-based weight limits for DEA models increases the consistency of the estimated DEA scores (more homogenous weights) and that these limits do not present mathematical infeasibility problems while avoiding the need for subjectively restricting weight variation in DEA. 相似文献

13.

Quantification of symmetry for functional data with application to equine lameness classification

Helle Sørensen Anders Tolver Maj Halling Thomsen Pia Haubro Andersen 《Journal of applied statistics》2012,39(2):337-360

This paper presents a study on symmetry of repeated bi-phased data signals, in particular, on quantification of the deviation between the two parts of the signal. Three symmetry scores are defined using functional data techniques such as smoothing and registration. One score is related to the L ₂-distance between the two parts of the signal, whereas the other two are constructed to specifically measure differences in amplitude and phase. Moreover, symmetry scores based on functional principal component analysis (PCA) are examined. The scores are applied to acceleration signals from a study on equine gait. The scores turn out to be highly associated with lameness, and their applicability for lameness quantification and detection is investigated. Four classification approaches turn out to give similar results. The scores describing amplitude and phase variation turn out to outperform the PCA scores when it comes to the classification of lameness. 相似文献

14.

Yufei Wu Guan Yu 《Statistical Analysis and Data Mining》2020,13(5):437-450

Linear discriminant analysis (LDA) is widely used for various binary classification problems. In contrast to the LDA that estimates the precision matrix Ω and the mean difference vector δ in the classification rule separately, the linear programming discriminant (LPD) rule estimates the product Ωδ directly through a constrained ℓ₁ minimization. The LPD rule has very good classification performance on many high‐dimensional binary classification problems. However, to estimate β ^* = Ωδ , the LPD rule uses equal weights for all the elements of β ^* in the constrained ℓ₁ minimization. It may not deliver the optimal estimate of β ^* , and therefore the estimated discriminant direction can be suboptimal. In order to obtain better estimates of β ^* and the discriminant direction, we can heavily penalize β_j in the constrained ℓ₁ minimization if we suspect the j th feature is useless for the classification while moderately penalize β_j if we suspect the j th feature is useful. In this paper, based on the LPD rule and some popular feature screening methods, we propose a new weighted linear programming discriminant (WLPD) rule for the high‐dimensional binary classification problem. The screening statistics used in the marginal two‐sample t ‐test screening, Kolmogorov–Smirnov filter, and the maximum marginal likelihood screening will be used to construct appropriate weights for different elements of β ^* flexibly. Besides the linear programming algorithm, we develop a new alternating direction method of multipliers algorithm to solve the high‐dimensional constrained ℓ₁ minimization problem efficiently. Our numerical studies show that our proposed WLPD rule can outperform LPD and serve as an effective binary classification tool. 相似文献

15.

A Variable Selection Criterion for Two Sets of Principal Component Scores in Principal Canonical Correlation Analysis

Toru Ogura Yasunori Fujikoshi Takakazu Sugiyama 《统计学通讯:理论与方法》2013,42(12):2118-2135

Canonical correlation analysis (CCA) is often used to analyze the correlation between two random vectors. However, sometimes interpretation of CCA results may be hard. In an attempt to address these difficulties, principal canonical correlation analysis (PCCA) was proposed. PCCA is CCA between two sets of principal component (PC) scores. We consider the problem of selecting useful PC scores in CCA. A variable selection criterion for one set of PC scores has been proposed by Ogura (2010), here, we propose a variable selection criterion for two sets of PC scores in PCCA. Furthermore, we demonstrate the effectiveness of this criterion. 相似文献

16.

THE REDUCED-RANK GROWTH CURVE MODEL FOR DISCRIMINANT ANALYSIS OF LONGITUDINAL DATA

Jeffrey M. Albert Anant M. Kshirsagar 《Australian & New Zealand Journal of Statistics》1993,35(3):345-357

This paper presents a method of discriminant analysis especially suited to longitudinal data. The approach is in the spirit of canonical variate analysis (CVA) and is similarly intended to reduce the dimensionality of multivariate data while retaining information about group differences. A drawback of CVA is that it does not take advantage of special structures that may be anticipated in certain types of data. For longitudinal data, it is often appropriate to specify a growth curve structure (as given, for example, in the model of Potthoff & Roy, 1964). The present paper focuses on this growth curve structure, utilizing it in a model-based approach to discriminant analysis. For this purpose the paper presents an extension of the reduced-rank regression model, referred to as the reduced-rank growth curve (RRGC) model. It estimates discriminant functions via maximum likelihood and gives a procedure for determining dimensionality. This methodology is exploratory only, and is illustrated by a well-known dataset from Grizzle & Allen (1969). 相似文献

17.

High-dimensional Markowitz portfolio optimization problem: empirical comparison of covariance matrix estimators

Young-Geun Choi Sujung Choi 《Journal of Statistical Computation and Simulation》2019,89(7):1278-1300

We compare the performance of recently developed regularized covariance matrix estimators for Markowitz's portfolio optimization and of the minimum variance portfolio (MVP) problem in particular. We focus on seven estimators that are applied to the MVP problem in the literature; three regularize the eigenvalues of the sample covariance matrix, and the other four assume the sparsity of the true covariance matrix or its inverse. Comparisons are made with two sets of long-term S&P 500 stock return data that represent two extreme scenarios of active and passive management. The results show that the MVPs with sparse covariance estimators have high Sharpe ratios but that the naive diversification (also known as the ‘uniform (on market share) portfolio’) still performs well in terms of wealth growth. 相似文献

18.

用于分类的随机森林和Bagging分类树比较 总被引：5，自引：0，他引：5

马景义谢邦昌《统计与信息论坛》2010,25(10):18-22

借助试验数据,从两种理论分析角度解释随机森林算法优于Bagging分类树算法的原因。将两种算法表述在两种不同的框架下,消除了这两种算法分析中的一些模糊之处。尤其在第二种分析框架下,更能清楚的看出,之所以随机森林算法优于Bagging分类树算法,是因为随机森林算法对应更小的偏差。相似文献

19.

Understanding Canonical Correlation through the General Linear Model and Principal Components

Keith E. Muller 《The American statistician》2013,67(4):342-354

Canonical correlation has been little used and little understood, even by otherwise sophisticated analysts. An alternative approach to canonical correlation, based on a general linear multivariate model, is presented. Properties of principal component analysis are used to help explain the method. Standard computational methods for full rank canonical correlation, techniques for canonical correlation on component scores, and canonical correlation with less than full rank are discussed. They are seen to be essentially equivalent when the model equation for canonical correlation on component scores is presented. The two approaches to less than full rank situations are equivalent in some senses, but quite different in usefulness, depending on the application. An example dataset is analyzed in detail to help demonstrate the conclusions. 相似文献

20.

Philippe Casin 《Journal of applied statistics》2018,45(8):1396-1409

Techniques of credit scoring have been developed these last years in order to reduce the risk taken by banks and financial institutions in the loans that they are granting. Credit Scoring is a classification problem of individuals in one of the two following groups: defaulting borrowers or non-defaulting borrowers. The aim of this paper is to propose a new method of discrimination when the dependent variable is categorical and when a large number of categorical explanatory variables are retained. This method, Categorical Multiblock Linear Discriminant Analysis, computes components which take into account both relationships between explanatory categorical variables and canonical correlation between each explanatory categorical variable and the dependent variable. A comparison with three other techniques and an application on credit scoring data are provided. 相似文献