首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The analysis of high-dimensional data often begins with the identification of lower dimensional subspaces. Principal component analysis is a dimension reduction technique that identifies linear combinations of variables along which most variation occurs or which best “reconstruct” the original variables. For example, many temperature readings may be taken in a production process when in fact there are just a few underlying variables driving the process. A problem with principal components is that the linear combinations can seem quite arbitrary. To make them more interpretable, we introduce two classes of constraints. In the first, coefficients are constrained to equal a small number of values (homogeneity constraint). The second constraint attempts to set as many coefficients to zero as possible (sparsity constraint). The resultant interpretable directions are either calculated to be close to the original principal component directions, or calculated in a stepwise manner that may make the components more orthogonal. A small dataset on characteristics of cars is used to introduce the techniques. A more substantial data mining application is also given, illustrating the ability of the procedure to scale to a very large number of variables.  相似文献   

2.
In practice, when a principal component analysis is applied on a large number of variables the resultant principal components may not be easy to interpret, as each principal component is a linear combination of all the original variables. Selection of a subset of variables that contains, in some sense, as much information as possible and enhances the interpretations of the first few covariance principal components is one possible approach to tackle this problem. This paper describes several variable selection criteria and investigates which criteria are best for this purpose. Although some criteria are shown to be better than others, the main message of this study is that it is unwise to rely on only one or two criteria. It is also clear that the interdependence between variables and the choice of how to measure closeness between the original components and those using subsets of variables are both important in determining the best criteria to use.  相似文献   

3.
A number of results have been derived recently concerning the influence of individual observations in a principal component analysis. Some of these results, particularly those based on the correlation matrix, are applied to data consisting of seven anatomical measurements on students. The data have a correlation structure which is fairly typical of many found in allometry. This case study shows that theoretical influence functions often provide good estimates of the actual changes observed when individual observations are deleted from a principal component analysis. Different observations may be influential for different aspects of the principal component analysis (coefficients, variances and scores of principal components); these differences, and the distinction between outlying and influential observations are discussed in the context of the case study. A number of other complications, such as switching and rotation of principal components when an observation is deleted, are also illustrated.  相似文献   

4.
SUMMARY Automatic identification of faces from a database given a digital view is becoming increasingly important. The question arises whether or not there can be a face identification system similar to the fingerprinting system, where a certain number of matches are regarded as sufficient to identify the person in the database. We first give a very general review of the topic of facial measurements and indicate some deep statistical problems. We then analyze a database of photographs. Certain characteristics of the population are provided, such as the modes of variation and correlation structures using shape analysis. The data involve angles as well as distances. The principal component analysis for angular data is discussed, its conversion into landmark data is established and the two approaches are compared. A new approach of anchor shape analysis for specialized distances is discussed.  相似文献   

5.
Principal components are useful for multivariate process control. Typically, the principal component variables are often selected to summarize the variation in the process data. We provide an analysis to select the principal component variables to be included in a multivariate control chart that incorporates the unique aspects of the process control problem (rather than using traditional principal component guidelines).  相似文献   

6.
Principal component and correspondence analysis can both be used as exploratory methods for representing multivariate data in two dimensions. Circumstances under which the, possibly inappropriate, application of principal components to untransformed compositional data approximates to a correspondence analysis of the raw data are noted. Aitchison (1986) has proposed a method for the principal component analysis of compositional data involving transformation of the raw data. It is shown how this can be approximated by a correspondence analysis of appropriately transformed data. The latter approach may be preferable when there are zeroes in the data.  相似文献   

7.
We investigate the effect of measurement error on principal component analysis in the high‐dimensional setting. The effects of random, additive errors are characterized by the expectation and variance of the changes in the eigenvalues and eigenvectors. The results show that the impact of uncorrelated measurement error on the principal component scores is mainly in terms of increased variability and not bias. In practice, the error‐induced increase in variability is small compared with the original variability for the components corresponding to the largest eigenvalues. This suggests that the impact will be negligible when these component scores are used in classification and regression or for visualizing data. However, the measurement error will contribute to a large variability in component loadings, relative to the loading values, such that interpretation based on the loadings can be difficult. The results are illustrated by simulating additive Gaussian measurement error in microarray expression data from cancer tumours and control tissues.  相似文献   

8.
One strategy of exploratory factor analysis is to decide on the number of factors to extract by means of the eigenvalues of an initial principal component analysis. The present article proves that there is a non zero covariance of the factors with the components rejected when the number of factors to extract is determined by means of principal components analysis. Thus, some of the variance declared as irrelevant or unwanted in an initial principal component analysis is again part of the final factor model.  相似文献   

9.
Medical images and genetic assays typically generate data with more variables than subjects. Scientists may use a two-step approach for testing hypotheses about Gaussian mean vectors. In the first step, principal components analysis (PCA) selects a set of sample components fewer in number than the sample size. In the second step, applying classical multivariate analysis of variance (MANOVA) methods to the reduced set of variables provides the desired hypothesis tests. Simulation results presented here indicate that success of the PCA in the first step requires nearly all variation to occur in population components far fewer in number than the number of subjects. In the second step, multivariate tests fail to attain reasonable power except in restrictive, favorable cases. The results encourage using other approaches discussed in the article to provide dependable hypothesis testing with high dimension, low sample size data (HDLSS).  相似文献   

10.
Dynamic principal component analysis (DPCA), also known as frequency domain principal component analysis, has been developed by Brillinger [Time Series: Data Analysis and Theory, Vol. 36, SIAM, 1981] to decompose multivariate time-series data into a few principal component series. A primary advantage of DPCA is its capability of extracting essential components from the data by reflecting the serial dependence of them. It is also used to estimate the common component in a dynamic factor model, which is frequently used in econometrics. However, its beneficial property cannot be utilized when missing values are present, which should not be simply ignored when estimating the spectral density matrix in the DPCA procedure. Based on a novel combination of conventional DPCA and self-consistency concept, we propose a DPCA method when missing values are present. We demonstrate the advantage of the proposed method over some existing imputation methods through the Monte Carlo experiments and real data analysis.  相似文献   

11.
Double arrays of n rows and p columns can be regarded as n drawings from some p-dimensional population. A sequence of such arrays is considered. Principal component analysis for each array forms sequences of sample principal components and eigenvalues. The continuity of these sequences, in the sense of convergence with probability one and convergence in probability, is investigated, that appears to be informative for pattern study and prediction of principal components. Various features of paths of sequences of population principal components are highlighted through an example.  相似文献   

12.
Using the spatial dependence of observations from multivariate images, it is possible to construct methods for data reduction that perform better than the widely used principal components procedure. Switzer and Green introduced the min/max autocorrelation factors (MAF) process for transforming the data to a new set of vectors where the components are arranged according to the amount of autocorrelation. MAF performs well when the underlying image consists of large homogeneous regions. For images with many transitions between smaller homogeneous regions, however, MAF may not perform well. A modification of the MAF process, the restricted min/max autocorrelation factors (RMAF) process, which takes into account the transitions between homogeneous regions, is introduced. Simulation experiments show that large improvements can be achieved using RMAF rather than MAF.  相似文献   

13.
Data for studies of biological shape often consist of the locations of individually named pointslandmarks considered to be homologous' (to correspond biologically) from form to form. In 1917 D'Arcy Thompson introduced an elegant model of homology as deformation: the configuration of landmark locations for any one form is viewed as a finite sample from a smooth mapping representing its biological relationship to any other form of the data set. For data in two dimensions, multivariate statistical analysis of landmark locations may proceed unambiguously in terms of complex-valued shape coordinates (e,v) = (C?A)/(B?A) for sets of landmark triangles ABC. These are the coordinates of one vertex/landmark after scaling so that the remaining two vertices are at (0,0) and (1,0). Expressed in this fashion, the biological interpretation of the statistical analysis as a homology mapping would appear to depend on the triangulation. This paper introduces an analysis of landmark data and homology mappings using a hierarchy of geometric components of shape difference or shape change. Each component is a smooth deformation taking the form of a bivariate polynomial in the shape coordinates and is estimated in a manner nearly invariant with respect to the choice of a triangulation.  相似文献   

14.
ABSTRACT

This paper focuses on applying the method of observed confidence levels to problems commonly encountered in principal component analyses. In particular, we focus on assigning levels of confidence to the number of components that explain a specified proportion of variation in the original data. Approaches based on the normal model as well as a non parametric model are explored. The usefulness of the methods are discussed using an example and an empirical study.  相似文献   

15.
Principal components are a well established tool in dimension reduction. The extension to principal curves allows for general smooth curves which pass through the middle of a multidimensional data cloud. In this paper local principal curves are introduced, which are based on the localization of principal component analysis. The proposed algorithm is able to identify closed curves as well as multiple curves which may or may not be connected. For the evaluation of the performance of principal curves as tool for data reduction a measure of coverage is suggested. By use of simulated and real data sets the approach is compared to various alternative concepts of principal curves.  相似文献   

16.
Principal component analysis is a popular dimension reduction technique often used to visualize high‐dimensional data structures. In genomics, this can involve millions of variables, but only tens to hundreds of observations. Theoretically, such extreme high dimensionality will cause biased or inconsistent eigenvector estimates, but in practice, the principal component scores are used for visualization with great success. In this paper, we explore when and why the classical principal component scores can be used to visualize structures in high‐dimensional data, even when there are few observations compared with the number of variables. Our argument is twofold: First, we argue that eigenvectors related to pervasive signals will have eigenvalues scaling linearly with the number of variables. Second, we prove that for linearly increasing eigenvalues, the sample component scores will be scaled and rotated versions of the population scores, asymptotically. Thus, the visual information of the sample scores will be unchanged, even though the sample eigenvectors are biased. In the case of pervasive signals, the principal component scores can be used to visualize the population structures, even in extreme high‐dimensional situations.  相似文献   

17.
ABSTRACT

The broken-stick (BS) is a popular stopping rule in ecology to determine the number of meaningful components of principal component analysis. However, its properties have not been systematically investigated. The purpose of the current study is to evaluate its ability to detect the correct dimensionality in a data set and whether it tends to over- or underestimate it. A Monte Carlo protocol was carried out. Two main correlation matrices deemed usual in practice were used with three levels of correlation (0, 0.10 and 0.30) between components (generating oblique structure) and with different sample sizes. Analyses of the population correlation matrices indicated that, for extremely large sample sizes, the BS method could be correct for only one of the six simulated structure. It actually failed to identify the correct dimensionality half the time with orthogonal structures and did even worse with some oblique ones. In harder conditions, results show that the power of the BS decreases as sample size increases: weakening its usefulness in practice. Since the BS method seems unlikely to identify the underlying dimensionality of the data, and given that better stopping rules exist it appears as a poor choice when carrying principal component analysis.  相似文献   

18.
Univariate time series often take the form of a collection of curves observed sequentially over time. Examples of these include hourly ground-level ozone concentration curves. These curves can be viewed as a time series of functions observed at equally spaced intervals over a dense grid. Since functional time series may contain various types of outliers, we introduce a robust functional time series forecasting method to down-weigh the influence of outliers in forecasting. Through a robust principal component analysis based on projection pursuit, a time series of functions can be decomposed into a set of robust dynamic functional principal components and their associated scores. Conditioning on the estimated functional principal components, the crux of the curve-forecasting problem lies in modelling and forecasting principal component scores, through a robust vector autoregressive forecasting method. Via a simulation study and an empirical study on forecasting ground-level ozone concentration, the robust method demonstrates the superior forecast accuracy that dynamic functional principal component regression entails. The robust method also shows the superior estimation accuracy of the parameters in the vector autoregressive models for modelling and forecasting principal component scores, and thus improves curve forecast accuracy.  相似文献   

19.
In functional linear regression, one conventional approach is to first perform functional principal component analysis (FPCA) on the functional predictor and then use the first few leading functional principal component (FPC) scores to predict the response variable. The leading FPCs estimated by the conventional FPCA stand for the major source of variation of the functional predictor, but these leading FPCs may not be mostly correlated with the response variable, so the prediction accuracy of the functional linear regression model may not be optimal. In this paper, we propose a supervised version of FPCA by considering the correlation of the functional predictor and response variable. It can automatically estimate leading FPCs, which represent the major source of variation of the functional predictor and are simultaneously correlated with the response variable. Our supervised FPCA method is demonstrated to have a better prediction accuracy than the conventional FPCA method by using one real application on electroencephalography (EEG) data and three carefully designed simulation studies.  相似文献   

20.
Canonical variate analysis can be viewed as a two-stage principal component analysis. Explicit consideration of the principal components from the first stage, formalized in the content of shrunken estimators, leads to a number of practical advantages. In morphometric studies, the first eigenvector is often a size vector, with the remaining vectors contrast or shape-type vectors, so that a decomposition of the canonical variates into size and shape components can be achieved. In applied studies, often a small number of the principal components effect most of the separation between groups; plots of group means and associated concentration ellipses (ideally these should be circular) for important principal components facilitate graphical inspection. Of considerable practical importance is the potential for improved stability of the estimated canonical vectors. When the between-groups sum of squares for a particular principal component is small, and the corresponding eigenvalue of the within-groups correlation matrix is also small, marked instability of the canonical vectors can be expected. The introduction of shrunken estimators, by adding shrinkage constrants to the eigenvalues, leads to more stable coefficients.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号