首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In the classical principal component analysis (PCA), the empirical influence function for the sensitivity coefficient ρ is used to detect influential observations on the subspace spanned by the dominants principal components. In this article, we derive the influence function of ρ in the case where the reweighted minimum covariance determinant (MCD1) is used as estimator of multivariate location and scatter. Our aim is to confirm the reliability in terms of robustness of the MCD1 via the approach based on the influence function of the sensitivity coefficient.  相似文献   

2.
Hotelling's T2 statistic has many applications in multivariate analysis. In particular, it can be used to measure the influence that a particular observation vector has on parameter estimation. For example, in the bivariate case, there exists a direct relationship between the ellipse generated using a T2 statistic for individual observations and the hyperbolae generated using Hampel's influence function for the corresponding correlation coefficient. In this paper, we jointly use the components of an orthogonal decomposition of the T2 statistic and some influence functions to identify outliers or influential observations. Since the conditional components in the T2 statistic are related to the possible changes in the correlation between a variable and a group of other variables, we consider the theoretical influence functions of the correlations and multiple correlation coefficients. Finite-sample versions of these influence functions are used to find the estimated influence function values.  相似文献   

3.
In this study, classical and robust principal component analyses are used to evaluate socioeconomic development of regions of development agencies that give service on the purpose of decreasing development difference among regions in Turkey. Due to the high differences between development levels of regions outlier problem occurs, hence robust statistical methods are used. Also, classical and robust statistical methods are used to investigate if there are any outliers in data set. In classic principal component analyse, the number of observations must be larger than the number of variables. Otherwise determinant of covariance matrix is zero. In Robust method for Principal Component Analysis (ROBPCA), a robust approach to principal component analyse in high-dimensional data, even if the number of variables is larger than the number of observations, principal components are obtained. In this paper, firstly 26 development agencies are evaluated with 19 variables by using principal component analysis based on classical and robust scatter matrices and then these 26 development agencies are evaluated with 46 variables by using the ROBPCA method.  相似文献   

4.
Principal component analysis is a popular dimension reduction technique often used to visualize high‐dimensional data structures. In genomics, this can involve millions of variables, but only tens to hundreds of observations. Theoretically, such extreme high dimensionality will cause biased or inconsistent eigenvector estimates, but in practice, the principal component scores are used for visualization with great success. In this paper, we explore when and why the classical principal component scores can be used to visualize structures in high‐dimensional data, even when there are few observations compared with the number of variables. Our argument is twofold: First, we argue that eigenvectors related to pervasive signals will have eigenvalues scaling linearly with the number of variables. Second, we prove that for linearly increasing eigenvalues, the sample component scores will be scaled and rotated versions of the population scores, asymptotically. Thus, the visual information of the sample scores will be unchanged, even though the sample eigenvectors are biased. In the case of pervasive signals, the principal component scores can be used to visualize the population structures, even in extreme high‐dimensional situations.  相似文献   

5.
Block-structured correlation matrices are correlation matrices in which the p variables are subdivided into homogeneous groups, with equal correlations for variables within each group, and equal correlations between any given pair of variables from different groups. Block-structured correlation matrices arise as approximations for certain data sets’ true correlation matrices. A block structure in a correlation matrix entails a certain number of properties regarding its eigendecomposition and, therefore, a principal component analysis of the underlying data. This paper explores these properties, both from an algebraic and a geometric perspective, and discusses their robustness. Suggestions are also made regarding the choice of variables to be subjected to a principal component analysis, when in the presence of (approximately) block-structured variables.  相似文献   

6.
ABSTRACT

The broken-stick (BS) is a popular stopping rule in ecology to determine the number of meaningful components of principal component analysis. However, its properties have not been systematically investigated. The purpose of the current study is to evaluate its ability to detect the correct dimensionality in a data set and whether it tends to over- or underestimate it. A Monte Carlo protocol was carried out. Two main correlation matrices deemed usual in practice were used with three levels of correlation (0, 0.10 and 0.30) between components (generating oblique structure) and with different sample sizes. Analyses of the population correlation matrices indicated that, for extremely large sample sizes, the BS method could be correct for only one of the six simulated structure. It actually failed to identify the correct dimensionality half the time with orthogonal structures and did even worse with some oblique ones. In harder conditions, results show that the power of the BS decreases as sample size increases: weakening its usefulness in practice. Since the BS method seems unlikely to identify the underlying dimensionality of the data, and given that better stopping rules exist it appears as a poor choice when carrying principal component analysis.  相似文献   

7.
This paper examines local influence assessment in generalized autoregressive conditional heteroscesdasticity models with Gaussian and Student-t errors, where influence is examined via the likelihood displacement. The analysis of local influence is discussed under three perturbation schemes: data perturbation, innovative model perturbation and additive model perturbation. For each case, expressions for slope and curvature diagnostics are derived. Monte Carlo experiments are presented to determine the threshold values for locating influential observations. The empirical study of daily returns of the New York Stock Exchange composite index shows that local influence analysis is a useful technique for detecting influential observations; most of the observations detected as influential are associated with historical shocks in the market. Finally, based on this empirical study and the analysis of simulated data, some advice is given on how to use the discussed methodology.  相似文献   

8.
The influence of individual points in an ordinal logistic model is considered when the aim is to determine their effects on the predictive probability in a Bayesian predictive approach. Our concern is to study the effects produced when the data are slightly perturbed, in particular by observing how these perturbations will affect the predictive probabilities and consequently the classification of future cases. We consider the extent of the change in the predictive distribution when an individual point is omitted (deleted) from the sample by use of a divergence measure suggested by Johnson (1985) as a measure of discrepancy between the full data and the data with the case deleted. The methodology is illustrated on some data used in Titterington et al. (1981).  相似文献   

9.
ABSTRACT

Parallel analysis (Horn 1965) and the minimum average partial correlation (MAP; Velicer 1976) have been widely spread as optimal solutions to identify the correct number of axes in principal component analysis. Previous results showed, however, that they become inefficient when variables belonging to different components strongly correlate. Simulations are used to assess their power to detect the dimensionality of data sets with oblique structures. Overall, MAP had the best performances as it was more powerful and accurate than PA when the component structure was modestly oblique. However, both stopping rules performed poorly in the presence of highly oblique factors.  相似文献   

10.
Abstract

In this paper, we introduce Liu estimator for the vector of parameters in linear measurement error models and discuss its asymptotic properties. Based on the Liu estimator, diagnostic measures are developed to identify influential observations. Additionally, the analogs of Cook’s distance and likelihood distance are proposed to determine influential observations using case deletion approach. A parametric bootstrap procedure is used to obtain empirical distributions of the test statistics. Finally, the performance of the influence measures have been illustrated through simulation study and analyzing a real data set.  相似文献   

11.
周建  张敏 《统计研究》2014,31(9):37-43
本文采用最新前沿的宏观经济数据诊断和空间面板模型理论和方法,对我国30个省(西藏除外)1999-2011年的省际GDP强影响性特征进行了实证分析,在此基础上,进一步进行B-N分解深入研究了其形成机制。主要结论表明:我国省际GDP不仅空间关联,而且存在着显著的强影响性特征,每一年样本信息对于样本期内全部信息的影响程度和大小存在着差异。同时,我国省际GDP强影响性特征具有极其显著的形成机制,随机性成分是造成强影响性的主要根源,周期成分也产生较为重要的影响。以上结论对于我国省际GDP强影响性特征判断具有十分重要的政策含义,进一步对于我国政府熨平经济波动、实现宏观经济的平稳较快增长的政策制订和完善等具有十分重要的现实借鉴意义。  相似文献   

12.
Abstract. In this study we are concerned with inference on the correlation parameter ρ of two Brownian motions, when only high‐frequency observations from two one‐dimensional continuous Itô semimartingales, driven by these particular Brownian motions, are available. Estimators for ρ are constructed in two situations: either when both components are observed (at the same time), or when only one component is observed and the other one represents its volatility process and thus has to be estimated from the data as well. In the first case it is shown that our estimator has the same asymptotic behaviour as the standard one for i.i.d. normal observations, whereas a feasible estimator can still be defined in the second framework, but with a slower rate of convergence.  相似文献   

13.
The effect of nonstationarity in time series columns of input data in principal components analysis is examined. Nonstationarity are very common among economic indicators collected over time. They are subsequently summarized into fewer indices for purposes of monitoring. Due to the simultaneous drifting of the nonstationary time series usually caused by the trend, the first component averages all the variables without necessarily reducing dimensionality. Sparse principal components analysis can be used, but attainment of sparsity among the loadings (hence, dimension-reduction is achieved) is influenced by the choice of parameter(s) (λ 1,i ). Simulated data with more variables than the number of observations and with different patterns of cross-correlations and autocorrelations were used to illustrate the advantages of sparse principal components analysis over ordinary principal components analysis. Sparse component loadings for nonstationary time series data can be achieved provided that appropriate values of λ 1,j are used. We provide the range of values of λ 1,j that will ensure convergence of the sparse principal components algorithm and consequently achieve sparsity of component loadings.  相似文献   

14.
Exact influence measures are applied in the evaluation of a principal component decomposition for high dimensional data. Some data used for classifying samples of rice from their near infra-red transmission profiles, following a preliminary principal component analysis, are examined in detail. A normalization of eigenvalue influence statistics is proposed which ensures that measures reflect the relative orientations of observations, rather than their overall Euclidean distance from the sample mean. Thus, the analyst obtains more information from an analysis of eigenvalues than from approximate approaches to eigenvalue influence. This is particularly important for high dimensional data where a complete investigation of eigenvector perturbations may be cumbersome. The results are used to suggest a new class of influence measures based on ratios of Euclidean distances in orthogonal spaces.  相似文献   

15.
In geostatistics, detecting atypical observations is of special interest due to the changes they can cause in environmental and geological patterns. Several methods for detecting them have been already suggested for the univariate spatial case. However, the problem is more complicated when various variables are observed simultaneously and the spatial correlation among them must be taken into account. The aim of this paper is to detect outliers and influential observations in multivariate spatial linear models. For this purpose, we derive and explore two different methods. First, a multivariate version of the forward search algorithm is given, where locations with outliers are detected in the last steps of the procedure. Next, we derive influence measures to assess the impact of the observations on the multivariate spatial linear model. The procedures are easy to compute and to interpret by means of graphical representations. Finally, an example and a Monte Carlo study illustrate the performance of these methods for identification of outliers in multivariate spatial linear models.  相似文献   

16.
We develop a new principal components analysis (PCA) type dimension reduction method for binary data. Different from the standard PCA which is defined on the observed data, the proposed PCA is defined on the logit transform of the success probabilities of the binary observations. Sparsity is introduced to the principal component (PC) loading vectors for enhanced interpretability and more stable extraction of the principal components. Our sparse PCA is formulated as solving an optimization problem with a criterion function motivated from penalized Bernoulli likelihood. A Majorization-Minimization algorithm is developed to efficiently solve the optimization problem. The effectiveness of the proposed sparse logistic PCA method is illustrated by application to a single nucleotide polymorphism data set and a simulation study.  相似文献   

17.
A new method for constructing interpretable principal components is proposed. The method first clusters the variables, and then interpretable (sparse) components are constructed from the correlation matrices of the clustered variables. For the first step of the method, a new weighted-variances method for clustering variables is proposed. It reflects the nature of the problem that the interpretable components should maximize the explained variance and thus provide sparse dimension reduction. An important feature of the new clustering procedure is that the optimal number of clusters (and components) can be determined in a non-subjective manner. The new method is illustrated using well-known simulated and real data sets. It clearly outperforms many existing methods for sparse principal component analysis in terms of both explained variance and sparseness.  相似文献   

18.
ABSTRACT

In influence analysis several problems arise in the field of Principal Components when applying different sample versions. Among these are the difficulty of determining a certain correspondence between the eigenvalues before and after the deletion of observations, the choice of the sign of the eigenvectors and the computational problem derived from the resolution of a great number of eigenproblems. In this article, such problems are discussed from the joint influence point of view and a solution is proposed by using approximations. Furthermore, the influence on a new parameter of interest is introduced: the proportion of variance explained by a set of principal components.  相似文献   

19.
Canonical variate analysis can be viewed as a two-stage principal component analysis. Explicit consideration of the principal components from the first stage, formalized in the content of shrunken estimators, leads to a number of practical advantages. In morphometric studies, the first eigenvector is often a size vector, with the remaining vectors contrast or shape-type vectors, so that a decomposition of the canonical variates into size and shape components can be achieved. In applied studies, often a small number of the principal components effect most of the separation between groups; plots of group means and associated concentration ellipses (ideally these should be circular) for important principal components facilitate graphical inspection. Of considerable practical importance is the potential for improved stability of the estimated canonical vectors. When the between-groups sum of squares for a particular principal component is small, and the corresponding eigenvalue of the within-groups correlation matrix is also small, marked instability of the canonical vectors can be expected. The introduction of shrunken estimators, by adding shrinkage constrants to the eigenvalues, leads to more stable coefficients.  相似文献   

20.
The analysis of high-dimensional data often begins with the identification of lower dimensional subspaces. Principal component analysis is a dimension reduction technique that identifies linear combinations of variables along which most variation occurs or which best “reconstruct” the original variables. For example, many temperature readings may be taken in a production process when in fact there are just a few underlying variables driving the process. A problem with principal components is that the linear combinations can seem quite arbitrary. To make them more interpretable, we introduce two classes of constraints. In the first, coefficients are constrained to equal a small number of values (homogeneity constraint). The second constraint attempts to set as many coefficients to zero as possible (sparsity constraint). The resultant interpretable directions are either calculated to be close to the original principal component directions, or calculated in a stepwise manner that may make the components more orthogonal. A small dataset on characteristics of cars is used to introduce the techniques. A more substantial data mining application is also given, illustrating the ability of the procedure to scale to a very large number of variables.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号