首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Medical images and genetic assays typically generate data with more variables than subjects. Scientists may use a two-step approach for testing hypotheses about Gaussian mean vectors. In the first step, principal components analysis (PCA) selects a set of sample components fewer in number than the sample size. In the second step, applying classical multivariate analysis of variance (MANOVA) methods to the reduced set of variables provides the desired hypothesis tests. Simulation results presented here indicate that success of the PCA in the first step requires nearly all variation to occur in population components far fewer in number than the number of subjects. In the second step, multivariate tests fail to attain reasonable power except in restrictive, favorable cases. The results encourage using other approaches discussed in the article to provide dependable hypothesis testing with high dimension, low sample size data (HDLSS).  相似文献   

2.
Summary The detection of errors and outliers is an important step in data processing, especially those errors arising from data entry operations because they are of the entire responsability of the data processing staff. The duplicate performance method, is commonly used as an attempt to detect such type of errors. It implies typically typing twice the same data without any special precedence. If the errors are uniformly distributed among individuals, retyping a fraction of the total will also remove typically the same fraction of the errors. A new method is presented, which is able to improve that procedure by sorting the records putting first the most unlikely ones. The ability of the present methodology has been tested by a Monte Carlo simulation, using an existing database of categorical answers of housing characteristics in Uruguay. At first, it has been randomly contaiminated, and after that, the proposed procedure applied. The results show that if a partial retyping is done following the proposed order about 50 % of the errors can be removed while keeping the retyping effort between 4 and 14% of the dataset, while to attain a similar result with the standard methodology 50% (on, average) of the database should be processed. The new ordering is based upon the unrotated Principal Component Analysis (PCA) transformation of the previously coded data. No special shape of the multivariate distribution function is assumed or required.  相似文献   

3.
Clustering of Variables Around Latent Components   总被引:1,自引:0,他引:1  
Abstract

Clustering of variables around latent components is investigated as a means to organize multivariate data into meaningful structures. The coverage includes (i) the case where it is desirable to lump together correlated variables no matter whether the correlation coefficient is positive or negative; (ii) the case where negative correlation shows high disagreement among variables; (iii) an extension of the clustering techniques which makes it possible to explain the clustering of variables taking account of external data. The strategy basically consists in performing a hierarchical cluster analysis, followed by a partitioning algorithm. Both algorithms aim at maximizing the same criterion which reflects the extent to which variables in each cluster are related to the latent variable associated with this cluster. Illustrations are outlined using real data sets from sensory studies.  相似文献   

4.
5.
Principal components are useful for multivariate process control. Typically, the principal component variables are often selected to summarize the variation in the process data. We provide an analysis to select the principal component variables to be included in a multivariate control chart that incorporates the unique aspects of the process control problem (rather than using traditional principal component guidelines).  相似文献   

6.
A method for inducing a desired rank correlation matrix on multivariate input vectors for simulation studies has recently been developed by Iman and Conover (1982). The primary intention of this procedure is to produce correlated input variables for use with computer models. Since this procedure is distribution free and allows the exact marginal distributions to remain intact it can be used with any marginal distributions for which it is reasonable to think in terms of correlation. In this paper we present a series of rank correlation plots based on this procedure when the marginal distributions are normal, lognormal, uniform and loguniform. These plots provide a convenient tool both for aiding the modeler in determining the degree of dependence among input variables (rather than guessing) and for communicating with the modeler the effect of different correlation assumptions. In addition this procedure can be used with sample multivariate data by sampling directly from the respective marginal empirical distribution functions.  相似文献   

7.
The analysis of high-dimensional data often begins with the identification of lower dimensional subspaces. Principal component analysis is a dimension reduction technique that identifies linear combinations of variables along which most variation occurs or which best “reconstruct” the original variables. For example, many temperature readings may be taken in a production process when in fact there are just a few underlying variables driving the process. A problem with principal components is that the linear combinations can seem quite arbitrary. To make them more interpretable, we introduce two classes of constraints. In the first, coefficients are constrained to equal a small number of values (homogeneity constraint). The second constraint attempts to set as many coefficients to zero as possible (sparsity constraint). The resultant interpretable directions are either calculated to be close to the original principal component directions, or calculated in a stepwise manner that may make the components more orthogonal. A small dataset on characteristics of cars is used to introduce the techniques. A more substantial data mining application is also given, illustrating the ability of the procedure to scale to a very large number of variables.  相似文献   

8.
Datasets are sometimes divided into distinct subsets, e.g. due to multi-center sampling, or to variations in instruments, questionnaire item ordering or mode of administration, and the data analyst then needs to assess whether a joint analysis is meaningful. The Principal Component Analysis-based Data Structure Comparisons (PCADSC) tools are three new non-parametric, visual diagnostic tools for investigating differences in structure for two subsets of a dataset through covariance matrix comparisons by use of principal component analysis. The PCADCS tools are demonstrated in a data example using European Social Survey data on psychological well-being in three countries, Denmark, Sweden, and Bulgaria. The data structures are found to be different in Denmark and Bulgaria, and thus a comparison of for example mean psychological well-being scores is not meaningful. However, when comparing Denmark and Sweden, very similar data structures, and thus comparable concepts of well-being, are found. Therefore, inter-country comparisons are warranted for these countries.  相似文献   

9.
Regression tends to give very unstable and unreliable regression weights when predictors are highly collinear. Several methods have been proposed to counter this problem. A subset of these do so by finding components that summarize the information in the predictors and the criterion variables. The present paper compares six such methods (two of which are almost completely new) to ordinary regression: Partial least Squares (PLS), Principal Component regression (PCR), Principle covariates regression, reduced rank regression, and two variants of what is called power regression. The comparison is mainly done by means of a series of simulation studies, in which data are constructed in various ways, with different degrees of collinearity and noise, and the methods are compared in terms of their capability of recovering the population regression weights, as well as their prediction quality for the complete population. It turns out that recovery of regression weights in situations with collinearity is often very poor by all methods, unless the regression weights lie in the subspace spanning the first few principal components of the predictor variables. In those cases, typically PLS and PCR give the best recoveries of regression weights. The picture is inconclusive, however, because, especially in the study with more real life like simulated data, PLS and PCR gave the poorest recoveries of regression weights in conditions with relatively low noise and collinearity. It seems that PLS and PCR are particularly indicated in cases with much collinearity, whereas in other cases it is better to use ordinary regression. As far as prediction is concerned: Prediction suffers far less from collinearity than recovery of the regression weights.  相似文献   

10.
In this paper some hierarchical methods for identifying groups of variables are illustrated and compared. It is shown that the use of multivariate association measures between two sets of variables can overcome the drawbacks of the usually employed bivariate correlation coefficient, but the resulting methods are generally not monotonic. Thus a new multivariate association measure is proposed, based on the links existing between canonical correlation analysis and principal component analysis, which can be more suitably used for the purpose at hand. The hierarchical method based on the suggested measure is illustrated and compared with other possible solutions by analysing simulated and real data sets. Finally an extension of the suggested method to the more general situation of mixed (qualitative and quantitative) variables is proposed and theoretically discussed.  相似文献   

11.
This work is concerned with robustness in Principal Component Analysis (PCA). The approach, which we adopt here, is to replace the criterion of least squares by another criterion based on a convex and sufficiently differentiable loss function ρ. Using this criterion we propose a robust estimate of the location vector and introduce an orthogonality with respect to (w.r.t.) ρ in order to define the different steps of a PCA. The influence functions of a vector mean and principal vectors are developed in order to provide method for obtaining a robust PCA. The practical procedure is based on an alternative-steps algorithm.  相似文献   

12.
In practice, when a principal component analysis is applied on a large number of variables the resultant principal components may not be easy to interpret, as each principal component is a linear combination of all the original variables. Selection of a subset of variables that contains, in some sense, as much information as possible and enhances the interpretations of the first few covariance principal components is one possible approach to tackle this problem. This paper describes several variable selection criteria and investigates which criteria are best for this purpose. Although some criteria are shown to be better than others, the main message of this study is that it is unwise to rely on only one or two criteria. It is also clear that the interdependence between variables and the choice of how to measure closeness between the original components and those using subsets of variables are both important in determining the best criteria to use.  相似文献   

13.
Linear discriminant analysis between two populations is considered in this paper. Error rate is reviewed as a criterion for selection of variables, and a stepwise procedure is outlined that selects variables on the basis of empirical estimates of error. Problems with assessment of the selected variables are highlighted. A leave-one-out method is proposed for estimating the true error rate of the selected variables, or alternatively of the selection procedure itself. Monte Carlo simulations, of multivariate binary as well as multivariate normal data, demonstrate the feasibility of the proposed method and indicate its much greater accuracy relative to that of other available methods.  相似文献   

14.
Abstract

In this article we study the relationship between principal component analysis and a multivariate dependency measure. It is shown, via simulated examples and real data, that the information provided by principal components is compatible with that obtained via the dependency measure δ. Furthermore, we show that in some instances in which principal component analysis fails to give reasonable results due to nonlinearity among the random variables, the dependency statistic δ still provides good results. Finally, we give some ideas about using the statistic δ in order to reduce the dimensionality of a given data set.  相似文献   

15.
Given multivariate normal data and a certain spherically invariant prior distribution on the covariance matrix, it is desired to estimate the moments of the posterior marginal distributions of some scalar functions of the covariance matrix by importance sampling. To this end a family of distributions is defined on the group of orthogonal matrices and a procedure is proposed for selecting one of these distributions for use as a weighting distribution in the importance sampling process. In an example estimates are calculated for the posterior mean and variance of each element in the covariance matrix expressed in the original coordinates, for the posterior mean of each element in the correlation matrix expressed in the original coordinates, and for the posterior mean of each element in the covariance matrix expressed in the coordinates of the principal variables.  相似文献   

16.
We suggest a procedure to improve the overall performances of several existing methods for determining the number of factors in factor analysis by using alternative measures of correlation: Pearson's, Spearman's, Gini's, and a robust estimator of the covariance matrix (MCD). We examine the effect of the choice of the covariance used on the number of factors chosen by the KG rule of one, the 80% rule, the Minimum average partial (MAP), and the Parallel Analysis Methodology (PAM). Extensive simulations show that when the entire (or part) of the data come from heavy-tail (lognormal) distributions, ranking the variables which come from non symmetric distributions improves the performances of the methods. In this case, Gini is slightly better than Spearman. The PAM and MAP procedures are qualitatively superior to the KG and the 80% rules in determining the true number of factors. A real example involving data on document authorship is analyzed.  相似文献   

17.
The use of large-dimensional factor models in forecasting has received much attention in the literature with the consensus being that improvements on forecasts can be achieved when comparing with standard models. However, recent contributions in the literature have demonstrated that care needs to be taken when choosing which variables to include in the model. A number of different approaches to determining these variables have been put forward. These are, however, often based on ad hoc procedures or abandon the underlying theoretical factor model. In this article, we will take a different approach to the problem by using the least absolute shrinkage and selection operator (LASSO) as a variable selection method to choose between the possible variables and thus obtain sparse loadings from which factors or diffusion indexes can be formed. This allows us to build a more parsimonious factor model that is better suited for forecasting compared to the traditional principal components (PC) approach. We provide an asymptotic analysis of the estimator and illustrate its merits empirically in a forecasting experiment based on U.S. macroeconomic data. Overall we find that compared to PC we obtain improvements in forecasting accuracy and thus find it to be an important alternative to PC. Supplementary materials for this article are available online.  相似文献   

18.
The aim of this paper is to propose a theoretical “multi-phase” strategy for analysing in dynamic terms the territorial impact of agricultural and environmental EU policy measures. This approach should also allow to evaluate the adjustment capability of farms as a function of the characteristics of different territories. The proposed methodology is illustrated by an example using data relative to the 41 provinces of Northern Italy. In the first step, a multivariate statistical analysis (MSA) consisting in Principal Component Analysis and Cluster Analysis leads to the identification of homogeneous clusters of territorial units. The territorial mapping is conditional to a predetermined set of indicators that takes into account different aspects of agricultural development. In a second step, Positive Mathematical Programming (PMP) allows to introduce the impact of agricultural policies (compensatory payments, price changes, etc.) returning different scenarios of land use and agricultural profitability. According to the outputs of the PMP, the third step consists in a new MSA for detecting any changes in the territorial mapping. Convergence analysis can then synthesise the impact of the different policy options.  相似文献   

19.
This article considers an approach to estimating and testing a new Kronecker product covariance structure for three-level (multiple time points (p), multiple sites (u), and multiple response variables (q)) multivariate data. Testing of such covariance structure is potentially important for high dimensional multi-level multivariate data. The hypothesis testing procedure developed in this article can not only test the hypothesis for three-level multivariate data, but also can test many different hypotheses, such as blocked compound symmetry, for two-level multivariate data as special cases. The tests are implemented with two real data sets.  相似文献   

20.
主要采用主成分分析方法,综合主成分分析方法和系统聚类方法等多元统计中的数据处理手段,对全球可持续创新网络(CInet)于2004年组织调查的全球近500家企业所得数据进行分析。通过贵州省企业与全球其他国家的比较,发现在企业持续改进能力的组织与运作方面,贵州省企业与全球其他国家之间存在较大差异。为寻找造成这些差异的原因,采用综合主成分分析方法和系统聚类方法,建立了在持续改进的组织与运作方面能力强的目标企业群。然后通过贵州省企业与目标企业之间在企业组织与运作构成因子的对比分析,指出了贵州省企业在持续改进的组织与运作中所存在的问题,进而对贵州省企业提出了相应改进的建议及其对策。其中,目标企业的选取及其创新能力检验、数据表缺省项的填充方法、在分析数据时所采用的因子对比分析方法等对其他大型调研数据分析均具有一定的借鉴意义。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号