首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Principal component analysis (PCA) is a widely used statistical technique for determining subscales in questionnaire data. As in any other statistical technique, missing data may both complicate its execution and interpretation. In this study, six methods for dealing with missing data in the context of PCA are reviewed and compared: listwise deletion (LD), pairwise deletion, the missing data passive approach, regularized PCA, the expectation-maximization algorithm, and multiple imputation. Simulations show that except for LD, all methods give about equally good results for realistic percentages of missing data. Therefore, the choice of a procedure can be based on the ease of application or purely the convenience of availability of a technique.  相似文献   

2.
Principal component regression (PCR) has two steps: estimating the principal components and performing the regression using these components. These steps generally are performed sequentially. In PCR, a crucial issue is the selection of the principal components to be included in regression. In this paper, we build a hierarchical probabilistic PCR model with a dynamic component selection procedure. A latent variable is introduced to select promising subsets of components based upon the significance of the relationship between the response variable and principal components in the regression step. We illustrate this model using real and simulated examples. The simulations demonstrate that our approach outperforms some existing methods in terms of root mean squared error of the regression coefficient.  相似文献   

3.
Principal component analysis (PCA) and functional principal analysis are key tools in multivariate analysis, in particular modelling yield curves, but little attention is given to questions of uncertainty, neither in the components themselves nor in any derived quantities such as scores. Actuaries using PCA to model yield curves to assess interest rate risk for insurance companies are required to show any uncertainty in their calculations. Asymptotic results based on assumptions of multivariate normality are unsatisfactory for modest samples, and application of bootstrap methods is not straightforward, with the novel pitfalls of possible inversions in order of sample components and reversals of signs. We present methods for overcoming these difficulties and discuss arising of other potential hazards.  相似文献   

4.
Principal component analysis (PCA) is a popular technique that is useful for dimensionality reduction but it is affected by the presence of outliers. The outlier sensitivity of classical PCA (CPCA) has caused the development of new approaches. Effects of using estimates obtained by expectation–maximization – EM and multiple imputation – MI instead of outliers were examined on the artificial and a real data set. Furthermore, robust PCA based on minimum covariance determinant (MCD), PCA based on estimates obtained by EM instead of outliers and PCA based on estimates obtained by MI instead of outliers were compared with the results of CPCA. In this study, we tried to show the effects of using estimates obtained by MI and EM instead of outliers, depending on the ratio of outliers in data set. Finally, when the ratio of outliers exceeds 20%, we suggest the use of estimates obtained by MI and EM instead of outliers as an alternative approach.  相似文献   

5.
In order to explore and compare a finite number T of data sets by applying functional principal component analysis (FPCA) to the T associated probability density functions, we estimate these density functions by using the multivariate kernel method. The data set sizes being fixed, we study the behaviour of this FPCA under the assumption that all the bandwidth matrices used in the estimation of densities are proportional to a common parameter h and proportional to either the variance matrices or the identity matrix. In this context, we propose a selection criterion of the parameter h which depends only on the data and the FPCA method. Then, on simulated examples, we compare the quality of approximation of the FPCA when the bandwidth matrices are selected using either the previous criterion or two other classical bandwidth selection methods, that is, a plug-in or a cross-validation method.  相似文献   

6.
We consider the problem related to clustering of gamma-ray bursts (from “BATSE” catalogue) through kernel principal component analysis in which our proposed kernel outperforms results of other competent kernels in terms of clustering accuracy and we obtain three physically interpretable groups of gamma-ray bursts. The effectivity of the suggested kernel in combination with kernel principal component analysis in revealing natural clusters in noisy and nonlinear data while reducing the dimension of the data is also explored in two simulated data sets.  相似文献   

7.
A data table arranged according to two factors can often be considered a compositional table. An example is the number of unemployed people, split according to gender and age classes. Analyzed as compositions, the relevant information consists of ratios between different cells of such a table. This is particularly useful when analyzing several compositional tables jointly, where the absolute numbers are in very different ranges, e.g. if unemployment data are considered from different countries. Within the framework of the logratio methodology, compositional tables can be decomposed into independent and interactive parts, and orthonormal coordinates can be assigned to these parts. However, these coordinates usually require some prior knowledge about the data, and they are not easy to handle for exploring the relationships between the given factors. Here we propose a special choice of coordinates with direct relation to centered logratio (clr) coefficients, which are particularly useful for an interpretation in terms of the original cells of the tables. With these coordinates, robust principal component analysis (rPCA) is performed for dimension reduction, allowing to investigate relationships between the factors. The link between orthonormal coordinates and clr coefficients enables to apply rPCA, which would otherwise suffer from the singularity of clr coefficients.  相似文献   

8.
The problem of detecting influential observations in principalcomponent analysis was discussed by several authors. Radhakrishnan and kshirsagar ( 1981 ), Critchley ( 1985 ), jolliffe ( 1986 )among others discussed this topicby using the influence functions I(X;θs)and I(X;Vs)of eigenvalues and eigenvectors, which wwere derived under the assumption that the eigenvalues of interest were simple. In this paper we propose the influence functionsI(X;∑q s=1θsVsVs T)and I(x;∑q s=1VsVs t)(q<p;p:number of variables) to investigate the influence onthe subspace spanned by principal components. These influence functions are applicable not only to the case where the edigenvalues of interst are all simple but also to the case where there are some multiple eigenvalues among those of interest.  相似文献   

9.
In many clinical studies, longitudinal biomarkers are often used to monitor the progression of a disease. For example, in a kidney transplant study, the glomerular filtration rate (GFR) is used as a longitudinal biomarker to monitor the progression of the kidney function and the patient''s state of survival that is characterized by multiple time-to-event outcomes, such as kidney transplant failure and death. It is known that the joint modelling of longitudinal and survival data leads to a more accurate and comprehensive estimation of the covariates'' effect. While most joint models use the longitudinal outcome as a covariate for predicting survival, very few models consider the further decomposition of the variation within the longitudinal trajectories and its effect on survival. We develop a joint model that uses functional principal component analysis (FPCA) to extract useful features from the longitudinal trajectories and adopt the competing risk model to handle multiple time-to-event outcomes. The longitudinal trajectories and the multiple time-to-event outcomes are linked via the shared functional features. The application of our model on a real kidney transplant data set reveals the significance of these functional features, and a simulation study is carried out to validate the accurateness of the estimation method.  相似文献   

10.
In human mortality modelling, if a population consists of several subpopulations it can be desirable to model their mortality rates simultaneously while taking into account the heterogeneity among them. The mortality forecasting methods tend to result in divergent forecasts for subpopulations when independence is assumed. However, under closely related social, economic and biological backgrounds, mortality patterns of these subpopulations are expected to be non-divergent in the future. In this article, we propose a new method for coherent modelling and forecasting of mortality rates for multiple subpopulations, in the sense of nondivergent life expectancy among subpopulations. The mortality rates of subpopulations are treated as multilevel functional data and a weighted multilevel functional principal component (wMFPCA) approach is proposed to model and forecast them. The proposed model is applied to sex-specific data for nine developed countries, and the results show that, in terms of overall forecasting accuracy, the model outperforms the independent model and the Product-Ratio model as well as the unweighted multilevel functional principal component approach.  相似文献   

11.
This work is concerned with robustness in Principal Component Analysis (PCA). The approach, which we adopt here, is to replace the criterion of least squares by another criterion based on a convex and sufficiently differentiable loss function ρ. Using this criterion we propose a robust estimate of the location vector and introduce an orthogonality with respect to (w.r.t.) ρ in order to define the different steps of a PCA. The influence functions of a vector mean and principal vectors are developed in order to provide method for obtaining a robust PCA. The practical procedure is based on an alternative-steps algorithm.  相似文献   

12.
ABSTRACT

We propose a multiple imputation method based on principal component analysis (PCA) to deal with incomplete continuous data. To reflect the uncertainty of the parameters from one imputation to the next, we use a Bayesian treatment of the PCA model. Using a simulation study and real data sets, the method is compared to two classical approaches: multiple imputation based on joint modelling and on fully conditional modelling. Contrary to the others, the proposed method can be easily used on data sets where the number of individuals is less than the number of variables and when the variables are highly correlated. In addition, it provides unbiased point estimates of quantities of interest, such as an expectation, a regression coefficient or a correlation coefficient, with a smaller mean squared error. Furthermore, the widths of the confidence intervals built for the quantities of interest are often smaller whilst ensuring a valid coverage.  相似文献   

13.
This paper reviews various treatments of non-metric variables in partial least squares (PLS) and principal component analysis (PCA) algorithms. The performance of different treatments is compared in an extensive simulation study under several typical data generating processes and associated recommendations are made. Moreover, we find that PLS-based methods are to prefer in practice, since, independent of the data generating process, PLS performs either as good as PCA or significantly outperforms it. As an application of PLS and PCA algorithms with non-metric variables we consider construction of a wealth index to predict household expenditures. Consistent with our simulation study, we find that a PLS-based wealth index with dummy coding outperforms PCA-based ones.  相似文献   

14.
In this paper, a new method for robust principal component analysis (PCA) is proposed. PCA is a widely used tool for dimension reduction without substantial loss of information. However, the classical PCA is vulnerable to outliers due to its dependence on the empirical covariance matrix. To avoid such weakness, several alternative approaches based on robust scatter matrix were suggested. A popular choice is ROBPCA that combines projection pursuit ideas with robust covariance estimation via variance maximization criterion. Our approach is based on the fact that PCA can be formulated as a regression-type optimization problem, which is the main difference from the previous approaches. The proposed robust PCA is derived by substituting square loss function with a robust penalty function, Huber loss function. A practical algorithm is proposed in order to implement an optimization computation, and furthermore, convergence properties of the algorithm are investigated. Results from a simulation study and a real data example demonstrate the promising empirical properties of the proposed method.  相似文献   

15.
Datasets are sometimes divided into distinct subsets, e.g. due to multi-center sampling, or to variations in instruments, questionnaire item ordering or mode of administration, and the data analyst then needs to assess whether a joint analysis is meaningful. The Principal Component Analysis-based Data Structure Comparisons (PCADSC) tools are three new non-parametric, visual diagnostic tools for investigating differences in structure for two subsets of a dataset through covariance matrix comparisons by use of principal component analysis. The PCADCS tools are demonstrated in a data example using European Social Survey data on psychological well-being in three countries, Denmark, Sweden, and Bulgaria. The data structures are found to be different in Denmark and Bulgaria, and thus a comparison of for example mean psychological well-being scores is not meaningful. However, when comparing Denmark and Sweden, very similar data structures, and thus comparable concepts of well-being, are found. Therefore, inter-country comparisons are warranted for these countries.  相似文献   

16.
Block-structured correlation matrices are correlation matrices in which the p variables are subdivided into homogeneous groups, with equal correlations for variables within each group, and equal correlations between any given pair of variables from different groups. Block-structured correlation matrices arise as approximations for certain data sets’ true correlation matrices. A block structure in a correlation matrix entails a certain number of properties regarding its eigendecomposition and, therefore, a principal component analysis of the underlying data. This paper explores these properties, both from an algebraic and a geometric perspective, and discusses their robustness. Suggestions are also made regarding the choice of variables to be subjected to a principal component analysis, when in the presence of (approximately) block-structured variables.  相似文献   

17.
In principal component analysis (PCA), it is crucial to know how many principal components (PCs) should be retained in order to account for most of the data variability. A class of “objective” rules for finding this quantity is the class of cross-validation (CV) methods. In this work we compare three CV techniques showing how the performance of these methods depends on the covariance matrix structure. Finally we propose a rule for the choice of the “best” CV method and give an application to real data.  相似文献   

18.
Treating principal component analysis (PCA) and canonical variate analysis (CVA) as methods for approximating tables, we develop measures, collectively termed predictivity, that assess the quality of fit independently for each variable and for all dimensionalities. We illustrate their use with data from aircraft development, the African timber industry and copper froth measurements from the mining industry. Similar measures are described for assessing the predictivity associated with the individual samples (in the case of PCA and CVA) or group means (in the case of CVA). For these measures to be meaningful, certain essential orthogonality conditions must hold that are shown to be satisfied by predictivity.  相似文献   

19.
Statistics, as one of the applied sciences, has great impacts in vast area of other sciences. Prediction of protein structures with great emphasize on their geometrical features using dihedral angles has invoked the new branch of statistics, known as directional statistics. One of the available biological techniques to predict is molecular dynamics simulations producing high-dimensional molecular structure data. Hence, it is expected that the principal component analysis (PCA) can response some related statistical problems particulary to reduce dimensions of the involved variables. Since the dihedral angles are variables on non-Euclidean space (their locus is the torus), it is expected that direct implementation of PCA does not provide great information in this case. The principal geodesic analysis is one of the recent methods to reduce the dimensions in the non-Euclidean case. A procedure to utilize this technique for reducing the dimension of a set of dihedral angles is highlighted in this paper. We further propose an extension of this tool, implemented in such way the torus is approximated by the product of two unit circle and evaluate its application in studying a real data set. A comparison of this technique with some previous methods is also undertaken.  相似文献   

20.
Missing data are a common problem in almost all areas of empirical research. Ignoring the missing data mechanism, especially when data are missing not at random (MNAR), can result in biased and/or inefficient inference. Because MNAR mechanism is not verifiable based on the observed data, sensitivity analysis is often used to assess it. Current sensitivity analysis methods primarily assume a model for the response mechanism in conjunction with a measurement model and examine sensitivity to missing data mechanism via the parameters of the response model. Recently, Jamshidian and Mata (Post-modelling sensitivity analysis to detect the effect of missing data mechanism, Multivariate Behav. Res. 43 (2008), pp. 432–452) introduced a new method of sensitivity analysis that does not require the difficult task of modelling the missing data mechanism. In this method, a single measurement model is fitted to all of the data and to a sub-sample of the data. Discrepancy in the parameter estimates obtained from the the two data sets is used as a measure of sensitivity to missing data mechanism. Jamshidian and Mata describe their method mainly in the context of detecting data that are missing completely at random (MCAR). They used a bootstrap type method, that relies on heuristic input from the researcher, to test for the discrepancy of the parameter estimates. Instead of using bootstrap, the current article obtains confidence interval for parameter differences on two samples based on an asymptotic approximation. Because it does not use bootstrap, the developed procedure avoids likely convergence problems with the bootstrap methods. It does not require heuristic input from the researcher and can be readily implemented in statistical software. The article also discusses methods of obtaining sub-samples that may be used to test missing at random in addition to MCAR. An application of the developed procedure to a real data set, from the first wave of an ongoing longitudinal study on aging, is presented. Simulation studies are performed as well, using two methods of missing data generation, which show promise for the proposed sensitivity method. One method of missing data generation is also new and interesting in its own right.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号