首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Probabilistic Principal Component Analysis   总被引:2,自引:0,他引:2  
Principal component analysis (PCA) is a ubiquitous technique for data analysis and processing, but one which is not based on a probability model. We demonstrate how the principal axes of a set of observed data vectors may be determined through maximum likelihood estimation of parameters in a latent variable model that is closely related to factor analysis. We consider the properties of the associated likelihood function, giving an EM algorithm for estimating the principal subspace iteratively, and discuss, with illustrative examples, the advantages conveyed by this probabilistic approach to PCA.  相似文献   

2.
A crucial issue for principal components analysis (PCA) is to determine the number of principal components to capture the variability of usually high-dimensional data. In this article the dimension detection for PCA is formulated as a variable selection problem for regressions. The adaptive LASSO is used for the variable selection in this application. Simulations demonstrate that this approach is more accurate than existing methods in some cases and competitive in some others. The performance of this model is also illustrated using a real example.  相似文献   

3.
We develop a new principal components analysis (PCA) type dimension reduction method for binary data. Different from the standard PCA which is defined on the observed data, the proposed PCA is defined on the logit transform of the success probabilities of the binary observations. Sparsity is introduced to the principal component (PC) loading vectors for enhanced interpretability and more stable extraction of the principal components. Our sparse PCA is formulated as solving an optimization problem with a criterion function motivated from penalized Bernoulli likelihood. A Majorization-Minimization algorithm is developed to efficiently solve the optimization problem. The effectiveness of the proposed sparse logistic PCA method is illustrated by application to a single nucleotide polymorphism data set and a simulation study.  相似文献   

4.
We present a Bayesian model selection approach to estimate the intrinsic dimensionality of a high-dimensional dataset. To this end, we introduce a novel formulation of the probabilisitic principal component analysis model based on a normal-gamma prior distribution. In this context, we exhibit a closed-form expression of the marginal likelihood which allows to infer an optimal number of components. We also propose a heuristic based on the expected shape of the marginal likelihood curve in order to choose the hyperparameters. In nonasymptotic frameworks, we show on simulated data that this exact dimensionality selection approach is competitive with both Bayesian and frequentist state-of-the-art methods.  相似文献   

5.
Principal component regression (PCR) has two steps: estimating the principal components and performing the regression using these components. These steps generally are performed sequentially. In PCR, a crucial issue is the selection of the principal components to be included in regression. In this paper, we build a hierarchical probabilistic PCR model with a dynamic component selection procedure. A latent variable is introduced to select promising subsets of components based upon the significance of the relationship between the response variable and principal components in the regression step. We illustrate this model using real and simulated examples. The simulations demonstrate that our approach outperforms some existing methods in terms of root mean squared error of the regression coefficient.  相似文献   

6.
High-dimensional data often exhibit multi-collinearity, leading to unstable regression coefficients. To address sample selection bias and problems associated with high dimensionality, principal components were extracted and used as predictors in a switching regression model. Since principal component regression often results to decline in predictive ability due to the selection of few principal components, we formulate the model with nonparametric function of principal components in lieu of individual predictors. Simulation studies indicated better predictive ability for nonparametric principal component switching regression over the parametric counterpart while mitigating the adverse effects of multi-collinearity and high dimensionality.  相似文献   

7.
Medical images and genetic assays typically generate data with more variables than subjects. Scientists may use a two-step approach for testing hypotheses about Gaussian mean vectors. In the first step, principal components analysis (PCA) selects a set of sample components fewer in number than the sample size. In the second step, applying classical multivariate analysis of variance (MANOVA) methods to the reduced set of variables provides the desired hypothesis tests. Simulation results presented here indicate that success of the PCA in the first step requires nearly all variation to occur in population components far fewer in number than the number of subjects. In the second step, multivariate tests fail to attain reasonable power except in restrictive, favorable cases. The results encourage using other approaches discussed in the article to provide dependable hypothesis testing with high dimension, low sample size data (HDLSS).  相似文献   

8.
Principal component analysis (PCA) is a popular technique that is useful for dimensionality reduction but it is affected by the presence of outliers. The outlier sensitivity of classical PCA (CPCA) has caused the development of new approaches. Effects of using estimates obtained by expectation–maximization – EM and multiple imputation – MI instead of outliers were examined on the artificial and a real data set. Furthermore, robust PCA based on minimum covariance determinant (MCD), PCA based on estimates obtained by EM instead of outliers and PCA based on estimates obtained by MI instead of outliers were compared with the results of CPCA. In this study, we tried to show the effects of using estimates obtained by MI and EM instead of outliers, depending on the ratio of outliers in data set. Finally, when the ratio of outliers exceeds 20%, we suggest the use of estimates obtained by MI and EM instead of outliers as an alternative approach.  相似文献   

9.
In this paper, we propose a novel robust principal component analysis (PCA) for high-dimensional data in the presence of various heterogeneities, in particular strong tailing and outliers. A transformation motivated by the characteristic function is constructed to improve the robustness of the classical PCA. The suggested method has the distinct advantage of dealing with heavy-tail-distributed data, whose covariances may be non-existent (positively infinite, for instance), in addition to the usual outliers. The proposed approach is also a case of kernel principal component analysis (KPCA) and employs the robust and non-linear properties via a bounded and non-linear kernel function. The merits of the new method are illustrated by some statistical properties, including the upper bound of the excess error and the behaviour of the large eigenvalues under a spiked covariance model. Additionally, using a variety of simulations, we demonstrate the benefits of our approach over the classical PCA. Finally, using data on protein expression in mice of various genotypes in a biological study, we apply the novel robust PCA to categorise the mice and find that our approach is more effective at identifying abnormal mice than the classical PCA.  相似文献   

10.
ABSTRACT

The broken-stick (BS) is a popular stopping rule in ecology to determine the number of meaningful components of principal component analysis. However, its properties have not been systematically investigated. The purpose of the current study is to evaluate its ability to detect the correct dimensionality in a data set and whether it tends to over- or underestimate it. A Monte Carlo protocol was carried out. Two main correlation matrices deemed usual in practice were used with three levels of correlation (0, 0.10 and 0.30) between components (generating oblique structure) and with different sample sizes. Analyses of the population correlation matrices indicated that, for extremely large sample sizes, the BS method could be correct for only one of the six simulated structure. It actually failed to identify the correct dimensionality half the time with orthogonal structures and did even worse with some oblique ones. In harder conditions, results show that the power of the BS decreases as sample size increases: weakening its usefulness in practice. Since the BS method seems unlikely to identify the underlying dimensionality of the data, and given that better stopping rules exist it appears as a poor choice when carrying principal component analysis.  相似文献   

11.
In the classical principal component analysis (PCA), the empirical influence function for the sensitivity coefficient ρ is used to detect influential observations on the subspace spanned by the dominants principal components. In this article, we derive the influence function of ρ in the case where the reweighted minimum covariance determinant (MCD1) is used as estimator of multivariate location and scatter. Our aim is to confirm the reliability in terms of robustness of the MCD1 via the approach based on the influence function of the sensitivity coefficient.  相似文献   

12.
ABSTRACT

Canonical correlations are maximized correlation coefficients indicating the relationships between pairs of canonical variates that are linear combinations of the two sets of original variables. The number of non-zero canonical correlations in a population is called its dimensionality. Parallel analysis (PA) is an empirical method for determining the number of principal components or factors that should be retained in factor analysis. An example is given to illustrate for adapting proposed procedures based on PA and bootstrap modified PA to the context of canonical correlation analysis (CCA). The performances of the proposed procedures are evaluated in a simulation study by their comparison with traditional sequential test procedures with respect to the under-, correct- and over-determination of dimensionality in CCA.  相似文献   

13.
The aims of this study were to undertake principal component analysis (PCA) of hip dysplasia (HD) and to examine the power of the principal components (PCs) in genome-wide association studies. A cohort of 278 dogs for PCA and that of 369 dogs for genotyping were used. The distraction index (DI), the dorsolateral subluxation (DLS) score, the Norberg angle (NA), and the extended-hip radiographic (EHR) score were used for the PCA. One thousand single-nucleotide polymorphisms (SNPs) (of 23,500) were used to simulate genetic locus sharing between the HD phenotypes and 1000 SNPs were used to calculate the genetic mapping power of the PCs. The DI and the DLS score (first group) reflected hip laxity and the NA and the EHR score (second group) reflected the congruency between the femoral head and acetabulum. The average hip measurements of the two groups reflected in the first PC captured 55% of total radiographic variation. The first four PCs captured 90% of the total variation. The PCs had higher statistical mapping power to detect pleiotropic quantitative trait loci (QTL) than the raw phenotypes. The PCA demonstrated for the first time that HD can be reduced mathematically into simpler components essential for its genetic dissection. Genes that contribute jointly to all four radiographic hip phenotypes can be detected by mapping their first four PCs, while those contributing to individual phenotypes can be mapped by association with the individual raw phenotype.  相似文献   

14.
In this paper, a new method for robust principal component analysis (PCA) is proposed. PCA is a widely used tool for dimension reduction without substantial loss of information. However, the classical PCA is vulnerable to outliers due to its dependence on the empirical covariance matrix. To avoid such weakness, several alternative approaches based on robust scatter matrix were suggested. A popular choice is ROBPCA that combines projection pursuit ideas with robust covariance estimation via variance maximization criterion. Our approach is based on the fact that PCA can be formulated as a regression-type optimization problem, which is the main difference from the previous approaches. The proposed robust PCA is derived by substituting square loss function with a robust penalty function, Huber loss function. A practical algorithm is proposed in order to implement an optimization computation, and furthermore, convergence properties of the algorithm are investigated. Results from a simulation study and a real data example demonstrate the promising empirical properties of the proposed method.  相似文献   

15.
This paper presents a novel ensemble classifier generation method by integrating the ideas of bootstrap aggregation and Principal Component Analysis (PCA). To create each individual member of an ensemble classifier, PCA is applied to every out-of-bag sample and the computed coefficients of all principal components are stored, and then the principal components calculated on the corresponding bootstrap sample are taken as additional elements of the original feature set. A classifier is trained with the bootstrap sample and some features randomly selected from the new feature set. The final ensemble classifier is constructed by majority voting of the trained base classifiers. The results obtained by empirical experiments and statistical tests demonstrate that the proposed method performs better than or as well as several other ensemble methods on some benchmark data sets publicly available from the UCI repository. Furthermore, the diversity-accuracy patterns of the ensemble classifiers are investigated by kappa-error diagrams.  相似文献   

16.
Principal component analysis (PCA) and functional principal analysis are key tools in multivariate analysis, in particular modelling yield curves, but little attention is given to questions of uncertainty, neither in the components themselves nor in any derived quantities such as scores. Actuaries using PCA to model yield curves to assess interest rate risk for insurance companies are required to show any uncertainty in their calculations. Asymptotic results based on assumptions of multivariate normality are unsatisfactory for modest samples, and application of bootstrap methods is not straightforward, with the novel pitfalls of possible inversions in order of sample components and reversals of signs. We present methods for overcoming these difficulties and discuss arising of other potential hazards.  相似文献   

17.
This work is concerned with robustness in Principal Component Analysis (PCA). The approach, which we adopt here, is to replace the criterion of least squares by another criterion based on a convex and sufficiently differentiable loss function ρ. Using this criterion we propose a robust estimate of the location vector and introduce an orthogonality with respect to (w.r.t.) ρ in order to define the different steps of a PCA. The influence functions of a vector mean and principal vectors are developed in order to provide method for obtaining a robust PCA. The practical procedure is based on an alternative-steps algorithm.  相似文献   

18.
In principal component analysis (PCA), it is crucial to know how many principal components (PCs) should be retained in order to account for most of the data variability. A class of “objective” rules for finding this quantity is the class of cross-validation (CV) methods. In this work we compare three CV techniques showing how the performance of these methods depends on the covariance matrix structure. Finally we propose a rule for the choice of the “best” CV method and give an application to real data.  相似文献   

19.
The effect of nonstationarity in time series columns of input data in principal components analysis is examined. Nonstationarity are very common among economic indicators collected over time. They are subsequently summarized into fewer indices for purposes of monitoring. Due to the simultaneous drifting of the nonstationary time series usually caused by the trend, the first component averages all the variables without necessarily reducing dimensionality. Sparse principal components analysis can be used, but attainment of sparsity among the loadings (hence, dimension-reduction is achieved) is influenced by the choice of parameter(s) (λ 1,i ). Simulated data with more variables than the number of observations and with different patterns of cross-correlations and autocorrelations were used to illustrate the advantages of sparse principal components analysis over ordinary principal components analysis. Sparse component loadings for nonstationary time series data can be achieved provided that appropriate values of λ 1,j are used. We provide the range of values of λ 1,j that will ensure convergence of the sparse principal components algorithm and consequently achieve sparsity of component loadings.  相似文献   

20.
In this article, we consider clustering based on principal component analysis (PCA) for high-dimensional mixture models. We present theoretical reasons why PCA is effective for clustering high-dimensional data. First, we derive a geometric representation of high-dimension, low-sample-size (HDLSS) data taken from a two-class mixture model. With the help of the geometric representation, we give geometric consistency properties of sample principal component scores in the HDLSS context. We develop ideas of the geometric representation and provide geometric consistency properties for multiclass mixture models. We show that PCA can cluster HDLSS data under certain conditions in a surprisingly explicit way. Finally, we demonstrate the performance of the clustering using gene expression datasets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号