首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Principal component analysis is a popular dimension reduction technique often used to visualize high‐dimensional data structures. In genomics, this can involve millions of variables, but only tens to hundreds of observations. Theoretically, such extreme high dimensionality will cause biased or inconsistent eigenvector estimates, but in practice, the principal component scores are used for visualization with great success. In this paper, we explore when and why the classical principal component scores can be used to visualize structures in high‐dimensional data, even when there are few observations compared with the number of variables. Our argument is twofold: First, we argue that eigenvectors related to pervasive signals will have eigenvalues scaling linearly with the number of variables. Second, we prove that for linearly increasing eigenvalues, the sample component scores will be scaled and rotated versions of the population scores, asymptotically. Thus, the visual information of the sample scores will be unchanged, even though the sample eigenvectors are biased. In the case of pervasive signals, the principal component scores can be used to visualize the population structures, even in extreme high‐dimensional situations.  相似文献   

2.
Most of the linear statistics deal with data lying in a Euclidean space. However, there are many examples, such as DNA molecule topological structures, in which the initial or the transformed data lie in a non-Euclidean space. To get a measure of variability in these situations, the principal component analysis (PCA) is usually performed on a Euclidean tangent space as it cannot be directly implemented on a non-Euclidean space. Instead, principal geodesic analysis (PGA) is a new tool that provides a measure of variability for nonlinear statistics. In this paper, the performance of this new tool is compared with that of the PCA using a real data set representing a DNA molecular structure. It is shown that due to the nonlinearity of space, the PGA explains more variability of the data than the PCA.  相似文献   

3.
We investigate the effect of measurement error on principal component analysis in the high‐dimensional setting. The effects of random, additive errors are characterized by the expectation and variance of the changes in the eigenvalues and eigenvectors. The results show that the impact of uncorrelated measurement error on the principal component scores is mainly in terms of increased variability and not bias. In practice, the error‐induced increase in variability is small compared with the original variability for the components corresponding to the largest eigenvalues. This suggests that the impact will be negligible when these component scores are used in classification and regression or for visualizing data. However, the measurement error will contribute to a large variability in component loadings, relative to the loading values, such that interpretation based on the loadings can be difficult. The results are illustrated by simulating additive Gaussian measurement error in microarray expression data from cancer tumours and control tissues.  相似文献   

4.
Tanaka(1988) derived two influence functions related to an ordinary eigenvalue problem (A–λs I)vs = 0 of a real symmetric matrix A and used them for sensitivity analysis in principal component analysis. One of these influence functions was used to develop sensitivity analysis in factor analysis (see e.g. Tanaka and Odaka, 1988a). The present paper derives some additional influence functions related to an ordinary eigenvalue problem and also several influence functions related to a generalized eigenvalue problem (A–θs A)us = 0, where A and B are real symmetric and real symmetric positive definite matrices, respectively. These influence functions are applicable not only to the case where the eigenvalues of interest are all simple but also to the case where there are some multiple eigenvalues among those of interest.  相似文献   

5.
Summary.  The problem of component choice in regression-based prediction has a long history. The main cases where important choices must be made are functional data analysis, and problems in which the explanatory variables are relatively high dimensional vectors. Indeed, principal component analysis has become the basis for methods for functional linear regression. In this context the number of components can also be interpreted as a smoothing parameter, and so the viewpoint is a little different from that for standard linear regression. However, arguments for and against conventional component choice methods are relevant to both settings and have received significant recent attention. We give a theoretical argument, which is applicable in a wide variety of settings, justifying the conventional approach. Although our result is of minimax type, it is not asymptotic in nature; it holds for each sample size. Motivated by the insight that is gained from this analysis, we give theoretical and numerical justification for cross-validation choice of the number of components that is used for prediction. In particular we show that cross-validation leads to asymptotic minimization of mean summed squared error, in settings which include functional data analysis.  相似文献   

6.
The cross-validation of principal components is a problem that occurs in many applications of statistics. The naive approach of omitting each observation in turn and repeating the principal component calculations is computationally costly. In this paper we present an efficient approach to leave-one-out cross-validation of principal components. This approach exploits the regular nature of leave-one-out principal component eigenvalue downdating. We derive influence statistics and consider the application to principal component regression.  相似文献   

7.
A number of results have been derived recently concerning the influence of individual observations in a principal component analysis. Some of these results, particularly those based on the correlation matrix, are applied to data consisting of seven anatomical measurements on students. The data have a correlation structure which is fairly typical of many found in allometry. This case study shows that theoretical influence functions often provide good estimates of the actual changes observed when individual observations are deleted from a principal component analysis. Different observations may be influential for different aspects of the principal component analysis (coefficients, variances and scores of principal components); these differences, and the distinction between outlying and influential observations are discussed in the context of the case study. A number of other complications, such as switching and rotation of principal components when an observation is deleted, are also illustrated.  相似文献   

8.
In this study, classical and robust principal component analyses are used to evaluate socioeconomic development of regions of development agencies that give service on the purpose of decreasing development difference among regions in Turkey. Due to the high differences between development levels of regions outlier problem occurs, hence robust statistical methods are used. Also, classical and robust statistical methods are used to investigate if there are any outliers in data set. In classic principal component analyse, the number of observations must be larger than the number of variables. Otherwise determinant of covariance matrix is zero. In Robust method for Principal Component Analysis (ROBPCA), a robust approach to principal component analyse in high-dimensional data, even if the number of variables is larger than the number of observations, principal components are obtained. In this paper, firstly 26 development agencies are evaluated with 19 variables by using principal component analysis based on classical and robust scatter matrices and then these 26 development agencies are evaluated with 46 variables by using the ROBPCA method.  相似文献   

9.
This work is devoted to robust principal component analysis (PCA). We give a comparison between some multivariate estimators of location and scatter by computing the influence functions of the sensitivity coefficient ρ corresponding to these estimators, and the mean squared error (MSE) of estimators of ρ. The coefficient ρ measures the closeness between the subspaces spanned by the initial eigenvectors and their corresponding version derived from an infinitesimal perturbation of the data distribution.  相似文献   

10.
To compare their performance on high dimensional data, several regression methods are applied to data sets in which the number of exploratory variables greatly exceeds the sample sizes. The methods are stepwise regression, principal components regression, two forms of latent root regression, partial least squares, and a new method developed here. The data are four sample sets for which near infrared reflectance spectra have been determined and the regression methods use the spectra to estimate the concentration of various chemical constituents, the latter having been determined by standard chemical analysis. Thirty-two regression equations are estimated using each method and their performances are evaluated using validation data sets. Although it is the most widely used, stepwise regression was decidedly poorer than the other methods considered. Differences between the latter were small with partial least squares performing slightly better than other methods under all criteria examined, albeit not by a statistically significant amount.  相似文献   

11.
The Gaussian rank correlation equals the usual correlation coefficient computed from the normal scores of the data. Although its influence function is unbounded, it still has attractive robustness properties. In particular, its breakdown point is above 12%. Moreover, the estimator is consistent and asymptotically efficient at the normal distribution. The correlation matrix obtained from pairwise Gaussian rank correlations is always positive semidefinite, and very easy to compute, also in high dimensions. We compare the properties of the Gaussian rank correlation with the popular Kendall and Spearman correlation measures. A simulation study confirms the good efficiency and robustness properties of the Gaussian rank correlation. In the empirical application, we show how it can be used for multivariate outlier detection based on robust principal component analysis.  相似文献   

12.
High-content automated imaging platforms allow the multiplexing of several targets simultaneously to generate multi-parametric single-cell data sets over extended periods of time. Typically, standard simple measures such as mean value of all cells at every time point are calculated to summarize the temporal process, resulting in loss of time dynamics of the single cells. Multiple experiments are performed but observation time points are not necessarily identical, leading to difficulties when integrating summary measures from different experiments. We used functional data analysis to analyze continuous curve data, where the temporal process of a response variable for each single cell can be described using a smooth curve. This allows analyses to be performed on continuous functions, rather than on original discrete data points. Functional regression models were applied to determine common temporal characteristics of a set of single cell curves and random effects were employed in the models to explain variation between experiments. The aim of the multiplexing approach is to simultaneously analyze the effect of a large number of compounds in comparison to control to discriminate between their mode of action. Functional principal component analysis based on T-statistic curves for pairwise comparison to control was used to study time-dependent compound effects.  相似文献   

13.
Random forests are widely used in many research fields for prediction and interpretation purposes. Their popularity is rooted in several appealing characteristics, such as their ability to deal with high dimensional data, complex interactions and correlations between variables. Another important feature is that random forests provide variable importance measures that can be used to identify the most important predictor variables. Though there are alternatives like complete case analysis and imputation, existing methods for the computation of such measures cannot be applied straightforward when the data contains missing values. This paper presents a solution to this pitfall by introducing a new variable importance measure that is applicable to any kind of data—whether it does or does not contain missing values. An extensive simulation study shows that the new measure meets sensible requirements and shows good variable ranking properties. An application to two real data sets also indicates that the new approach may provide a more sensible variable ranking than the widespread complete case analysis. It takes the occurrence of missing values into account which makes results also differ from those obtained under multiple imputation.  相似文献   

14.
Abstract. We review and extend some statistical tools that have proved useful for analysing functional data. Functional data analysis primarily is designed for the analysis of random trajectories and infinite‐dimensional data, and there exists a need for the development of adequate statistical estimation and inference techniques. While this field is in flux, some methods have proven useful. These include warping methods, functional principal component analysis, and conditioning under Gaussian assumptions for the case of sparse data. The latter is a recent development that may provide a bridge between functional and more classical longitudinal data analysis. Besides presenting a brief review of functional principal components and functional regression, we develop some concepts for estimating functional principal component scores in the sparse situation. An extension of the so‐called generalized functional linear model to the case of sparse longitudinal predictors is proposed. This extension includes functional binary regression models for longitudinal data and is illustrated with data on primary biliary cirrhosis.  相似文献   

15.
The essence of the generalised multivariate Behrens–Fisher problem (BFP) is how to test the null hypothesis of equality of mean vectors for two or more populations when their dispersion matrices differ. Solutions to the BFP usually assume variables are multivariate normal and do not handle high‐dimensional data. In ecology, species' count data are often high‐dimensional, non‐normal and heterogeneous. Also, interest lies in analysing compositional dissimilarities among whole communities in non‐Euclidean (semi‐metric or non‐metric) multivariate space. Hence, dissimilarity‐based tests by permutation (e.g., PERMANOVA, ANOSIM) are used to detect differences among groups of multivariate samples. Such tests are not robust, however, to heterogeneity of dispersions in the space of the chosen dissimilarity measure, most conspicuously for unbalanced designs. Here, we propose a modification to the PERMANOVA test statistic, coupled with either permutation or bootstrap resampling methods, as a solution to the BFP for dissimilarity‐based tests. Empirical simulations demonstrate that the type I error remains close to nominal significance levels under classical scenarios known to cause problems for the un‐modified test. Furthermore, the permutation approach is found to be more powerful than the (more conservative) bootstrap for detecting changes in community structure for real ecological datasets. The utility of the approach is shown through analysis of 809 species of benthic soft‐sediment invertebrates from 101 sites in five areas spanning 1960 km along the Norwegian continental shelf, based on the Jaccard dissimilarity measure.  相似文献   

16.
Treating principal component analysis (PCA) and canonical variate analysis (CVA) as methods for approximating tables, we develop measures, collectively termed predictivity, that assess the quality of fit independently for each variable and for all dimensionalities. We illustrate their use with data from aircraft development, the African timber industry and copper froth measurements from the mining industry. Similar measures are described for assessing the predictivity associated with the individual samples (in the case of PCA and CVA) or group means (in the case of CVA). For these measures to be meaningful, certain essential orthogonality conditions must hold that are shown to be satisfied by predictivity.  相似文献   

17.
Mihyun Kim 《Statistics》2019,53(4):699-720
Functional principal component scores are commonly used to reduce mathematically infinitely dimensional functional data to finite dimensional vectors. In certain applications, most notably in finance, these scores exhibit tail behaviour consistent with the assumption of regular variation. Knowledge of the index of the regular variation, α, is needed to apply methods of extreme value theory. The most commonly used method of the estimation of α is the Hill estimator. We derive conditions under which the Hill estimator computed from the sample scores is consistent for the tail index of the unobservable population scores.  相似文献   

18.
对2000—2006年中国东部、中部、西部地区保险密度的差异进行了比较,对保险密度的影响因子进行了主成分分析,利用PandData模型分别对东部、中部、西部地区进行回归分析,研究表明:引起各地区保险密度差异的因素主要有地区人均GDP、人均消费水平、文化程度、城市化、产业结构、社会福利费用、性别比和年龄结构等,不同地区保险密度的影响因素和影响程度不同。为了缩小保险密度区域性差异,应针对不同地区采取相应的政策措施。  相似文献   

19.
The case sensitivity function approach to influence analysis is introduced as a natural smooth extension of influence curve methodology in which both the insights of geometry and the power of (convex) analysis are available. In it, perturbation is defined as movement between probability vectors defining weighted empirical distributions. A Euclidean geometry is proposed giving such perturbations both size and direction. The notion of the salience of a perturbation is emphasized. This approach has several benefits. A general probability case weight analysis results. Answers to a number of outstanding questions follow directly. Rescaled versions of the three usual finite sample influence curve measures—seen now to be required for comparability across different-sized subsets of cases—are readily available. These new diagnostics directly measure the salience of the (infinitesimal) perturbations involved. Their essential unity, both within and between subsets, is evident geometrically. Finally it is shown how a relaxation strategy, in which a high dimensional ( O ( nCm )) discrete problem is replaced by a low dimensional ( O ( n )) continuous problem, can combine with (convex) optimization results to deliver better performance in challenging multiple-case influence problems. Further developments are briefly indicated.  相似文献   

20.
The analysis of high-dimensional data often begins with the identification of lower dimensional subspaces. Principal component analysis is a dimension reduction technique that identifies linear combinations of variables along which most variation occurs or which best “reconstruct” the original variables. For example, many temperature readings may be taken in a production process when in fact there are just a few underlying variables driving the process. A problem with principal components is that the linear combinations can seem quite arbitrary. To make them more interpretable, we introduce two classes of constraints. In the first, coefficients are constrained to equal a small number of values (homogeneity constraint). The second constraint attempts to set as many coefficients to zero as possible (sparsity constraint). The resultant interpretable directions are either calculated to be close to the original principal component directions, or calculated in a stepwise manner that may make the components more orthogonal. A small dataset on characteristics of cars is used to introduce the techniques. A more substantial data mining application is also given, illustrating the ability of the procedure to scale to a very large number of variables.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号