首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Differential analysis techniques are commonly used to offer scientists a dimension reduction procedure and an interpretable gateway to variable selection, especially when confronting high-dimensional genomic data. Huang et al. used a gene expression profile of breast cancer cell lines to identify genomic markers which are highly correlated with in vitro sensitivity of a drug Dasatinib. They considered three statistical methods to identify differentially expressed genes and finally used the results from the intersection. But the statistical methods that are used in the paper are not sufficient to select the genomic markers. In this paper we used three alternative statistical methods to select a combined list of genomic markers and compared the genes that were proposed by Huang et al. We then proposed to use sparse principal component analysis (Sparse PCA) to identify a final list of genomic markers. The Sparse PCA incorporates correlation into account among the genes and helps to draw a successful genomic markers discovery. We present a new and a small set of genomic markers to separate out the groups of patients effectively who are sensitive to the drug Dasatinib. The analysis procedure will also encourage scientists in identifying genomic markers that can help to separate out two groups.  相似文献   

2.
Statistics, as one of the applied sciences, has great impacts in vast area of other sciences. Prediction of protein structures with great emphasize on their geometrical features using dihedral angles has invoked the new branch of statistics, known as directional statistics. One of the available biological techniques to predict is molecular dynamics simulations producing high-dimensional molecular structure data. Hence, it is expected that the principal component analysis (PCA) can response some related statistical problems particulary to reduce dimensions of the involved variables. Since the dihedral angles are variables on non-Euclidean space (their locus is the torus), it is expected that direct implementation of PCA does not provide great information in this case. The principal geodesic analysis is one of the recent methods to reduce the dimensions in the non-Euclidean case. A procedure to utilize this technique for reducing the dimension of a set of dihedral angles is highlighted in this paper. We further propose an extension of this tool, implemented in such way the torus is approximated by the product of two unit circle and evaluate its application in studying a real data set. A comparison of this technique with some previous methods is also undertaken.  相似文献   

3.
Most of the linear statistics deal with data lying in a Euclidean space. However, there are many examples, such as DNA molecule topological structures, in which the initial or the transformed data lie in a non-Euclidean space. To get a measure of variability in these situations, the principal component analysis (PCA) is usually performed on a Euclidean tangent space as it cannot be directly implemented on a non-Euclidean space. Instead, principal geodesic analysis (PGA) is a new tool that provides a measure of variability for nonlinear statistics. In this paper, the performance of this new tool is compared with that of the PCA using a real data set representing a DNA molecular structure. It is shown that due to the nonlinearity of space, the PGA explains more variability of the data than the PCA.  相似文献   

4.
Summary.  Sparse clustered data arise in finely stratified genetic and epidemiologic studies and pose at least two challenges to inference. First, it is difficult to model and interpret the full joint probability of dependent discrete data, which limits the utility of full likelihood methods. Second, standard methods for clustered data, such as pairwise likelihood and the generalized estimating function approach, are unsuitable when the data are sparse owing to the presence of many nuisance parameters. We present a composite conditional likelihood for use with sparse clustered data that provides valid inferences about covariate effects on both the marginal response probabilities and the intracluster pairwise association. Our primary focus is on sparse clustered binary data, in which case the method proposed utilizes doubly discordant quadruplets drawn from each stratum to conduct inference about the intracluster pairwise odds ratios.  相似文献   

5.
Analyzing incomplete data for inferring the structure of gene regulatory networks (GRNs) is a challenging task in bioinformatic. Bayesian network can be successfully used in this field. k-nearest neighbor, singular value decomposition (SVD)-based and multiple imputation by chained equations are three fundamental imputation methods to deal with missing values. Path consistency (PC) algorithm based on conditional mutual information (PCA–CMI) is a famous algorithm for inferring GRNs. This algorithm needs the data set to be complete. However, the problem is that PCA–CMI is not a stable algorithm and when applied on permuted gene orders, different networks are obtained. We propose an order independent algorithm, PCA–CMI–OI, for inferring GRNs. After imputation of missing data, the performances of PCA–CMI and PCA–CMI–OI are compared. Results show that networks constructed from data imputed by the SVD-based method and PCA–CMI–OI algorithm outperform other imputation methods and PCA–CMI. An undirected or partially directed network is resulted by PC-based algorithms. Mutual information test (MIT) score, which can deal with discrete data, is one of the famous methods for directing the edges of resulted networks. We also propose a new score, ConMIT, which is appropriate for analyzing continuous data. Results shows that the precision of directing the edges of skeleton is improved by applying the ConMIT score.  相似文献   

6.
This paper reviews various treatments of non-metric variables in partial least squares (PLS) and principal component analysis (PCA) algorithms. The performance of different treatments is compared in an extensive simulation study under several typical data generating processes and associated recommendations are made. Moreover, we find that PLS-based methods are to prefer in practice, since, independent of the data generating process, PLS performs either as good as PCA or significantly outperforms it. As an application of PLS and PCA algorithms with non-metric variables we consider construction of a wealth index to predict household expenditures. Consistent with our simulation study, we find that a PLS-based wealth index with dummy coding outperforms PCA-based ones.  相似文献   

7.
Principal component analysis (PCA) is a widely used statistical technique for determining subscales in questionnaire data. As in any other statistical technique, missing data may both complicate its execution and interpretation. In this study, six methods for dealing with missing data in the context of PCA are reviewed and compared: listwise deletion (LD), pairwise deletion, the missing data passive approach, regularized PCA, the expectation-maximization algorithm, and multiple imputation. Simulations show that except for LD, all methods give about equally good results for realistic percentages of missing data. Therefore, the choice of a procedure can be based on the ease of application or purely the convenience of availability of a technique.  相似文献   

8.
Projection techniques for nonlinear principal component analysis   总被引:4,自引:0,他引:4  
Principal Components Analysis (PCA) is traditionally a linear technique for projecting multidimensional data onto lower dimensional subspaces with minimal loss of variance. However, there are several applications where the data lie in a lower dimensional subspace that is not linear; in these cases linear PCA is not the optimal method to recover this subspace and thus account for the largest proportion of variance in the data.Nonlinear PCA addresses the nonlinearity problem by relaxing the linear restrictions on standard PCA. We investigate both linear and nonlinear approaches to PCA both exclusively and in combination. In particular we introduce a combination of projection pursuit and nonlinear regression for nonlinear PCA. We compare the success of PCA techniques in variance recovery by applying linear, nonlinear and hybrid methods to some simulated and real data sets.We show that the best linear projection that captures the structure in the data (in the sense that the original data can be reconstructed from the projection) is not necessarily a (linear) principal component. We also show that the ability of certain nonlinear projections to capture data structure is affected by the choice of constraint in the eigendecomposition of a nonlinear transform of the data. Similar success in recovering data structure was observed for both linear and nonlinear projections.  相似文献   

9.
Principal component analysis (PCA) is widely used to analyze high-dimensional data, but it is very sensitive to outliers. Robust PCA methods seek fits that are unaffected by the outliers and can therefore be trusted to reveal them. FastHCS (high-dimensional congruent subsets) is a robust PCA algorithm suitable for high-dimensional applications, including cases where the number of variables exceeds the number of observations. After detailing the FastHCS algorithm, we carry out an extensive simulation study and three real data applications, the results of which show that FastHCS is systematically more robust to outliers than state-of-the-art methods.  相似文献   

10.
This paper investigates several techniques to discriminate two multivariate stationary signals. The methods considered include Gaussian likelihood ratio tests for variance equality, a chi-squared time-domain test, and a spectral-based test. The latter two tests assess equality of the multivariate autocovariance function of the two signals over many different lags. The Gaussian likelihood ratio test is perhaps best viewed as principal component analyses (PCA) without dimension reduction aspects; it can be modified to consider covariance features other than variances via dimension augmentation tactics. A simulation study is constructed that shows how one can make inappropriate conclusions with PCA tests, even when dimension augmentation techniques are used to incorporate non-zero lag autocovariances into the analysis. The various discrimination methods are first discussed. A simulation study then illuminates the various properties of the methods. In this pursuit, calculations are needed to identify several multivariate time series models with specific autocovariance properties. To demonstrate the applicability of the methods, nine US and Canadian weather stations from three distinct regions are clustered. Here, the spectral clustering perfectly identified distinct regions, the chi-squared test performed marginally, and the PCA/likelihood ratio method did not perform well.  相似文献   

11.
Technical advances in many areas have produced more complicated high‐dimensional data sets than the usual high‐dimensional data matrix, such as the fMRI data collected in a period for independent trials, or expression levels of genes measured in different tissues. Multiple measurements exist for each variable in each sample unit of these data. Regarding the multiple measurements as an element in a Hilbert space, we propose Principal Component Analysis (PCA) in Hilbert space. The principal components (PCs) thus defined carry information about not only the patterns of variations in individual variables but also the relationships between variables. To extract the features with greatest contributions to the explained variations in PCs for high‐dimensional data, we also propose sparse PCA in Hilbert space by imposing a generalized elastic‐net constraint. Efficient algorithms to solve the optimization problems in our methods are provided. We also propose a criterion for selecting the tuning parameter.  相似文献   

12.
While studying the results from one European Parliament election, the question of principal component analysis (PCA) suitability for this kind of data was raised. Since multiparty data should be seen as compositional data (CD), the application of PCA is inadvisable and may conduct to ineligible results. This work points out the limitations of PCA to CD and presents a practical application to the results from the European Parliament election in 2004. We present a comparative study between the results of PCA, Crude PCA and Logcontrast PCA (Aitchison in: Biometrika 70:57–61, 1983; Kucera, Malmgren in: Marine Micropaleontology 34:117–120, 1998). As a conclusion of this study, and concerning the mentioned data set, the approach which produced clearer results was the Logcontrast PCA. Moreover, Crude PCA conducted to misleading results since nonlinear relations were presented between variables and the linear PCA proved, once again, to be inappropriate to analyse data which can be seen as CD.  相似文献   

13.
Treating principal component analysis (PCA) and canonical variate analysis (CVA) as methods for approximating tables, we develop measures, collectively termed predictivity, that assess the quality of fit independently for each variable and for all dimensionalities. We illustrate their use with data from aircraft development, the African timber industry and copper froth measurements from the mining industry. Similar measures are described for assessing the predictivity associated with the individual samples (in the case of PCA and CVA) or group means (in the case of CVA). For these measures to be meaningful, certain essential orthogonality conditions must hold that are shown to be satisfied by predictivity.  相似文献   

14.
In statistical practice, rectangular tables of numeric data are commonplace, and are often analyzed using dimension-reduction methods like the singular value decomposition and its close cousin, principal component analysis (PCA). This analysis produces score and loading matrices representing the rows and the columns of the original table and these matrices may be used for both prediction purposes and to gain structural understanding of the data. In some tables, the data entries are necessarily nonnegative (apart, perhaps, from some small random noise), and so the matrix factors meant to represent them should arguably also contain only nonnegative elements. This thinking, and the desire for parsimony, underlies such techniques as rotating factors in a search for “simple structure.” These attempts to transform score or loading matrices of mixed sign into nonnegative, parsimonious forms are, however, indirect and at best imperfect. The recent development of nonnegative matrix factorization, or NMF, is an attractive alternative. Rather than attempt to transform a loading or score matrix of mixed signs into one with only nonnegative elements, it directly seeks matrix factors containing only nonnegative elements. The resulting factorization often leads to substantial improvements in interpretability of the factors. We illustrate this potential by synthetic examples and a real dataset. The question of exactly when NMF is effective is not fully resolved, but some indicators of its domain of success are given. It is pointed out that the NMF factors can be used in much the same way as those coming from PCA for such tasks as ordination, clustering, and prediction. Supplementary materials for this article are available online.  相似文献   

15.
A crucial issue for principal components analysis (PCA) is to determine the number of principal components to capture the variability of usually high-dimensional data. In this article the dimension detection for PCA is formulated as a variable selection problem for regressions. The adaptive LASSO is used for the variable selection in this application. Simulations demonstrate that this approach is more accurate than existing methods in some cases and competitive in some others. The performance of this model is also illustrated using a real example.  相似文献   

16.
Summary.  We present a new class of methods for high dimensional non-parametric regression and classification called sparse additive models. Our methods combine ideas from sparse linear modelling and additive non-parametric regression. We derive an algorithm for fitting the models that is practical and effective even when the number of covariates is larger than the sample size. Sparse additive models are essentially a functional version of the grouped lasso of Yuan and Lin. They are also closely related to the COSSO model of Lin and Zhang but decouple smoothing and sparsity, enabling the use of arbitrary non-parametric smoothers. We give an analysis of the theoretical properties of sparse additive models and present empirical results on synthetic and real data, showing that they can be effective in fitting sparse non-parametric models in high dimensional data.  相似文献   

17.
Many applications of statistical methods for data that are spatially correlated require the researcher to specify the correlation structure of the data. This can be a difficult task as there are many candidate structures. Some spatial correlation structures depend on the distance between the observed data points while others rely on neighborhood structures. In this paper, Bayesian methods that systematically determine the ‘best’ correlation structure from a predefined class of structures are proposed. Bayes factors, Highest Probability Models, and Bayesian Model Averaging are employed to determine the ‘best’ correlation structure and to average across these structures to create a non-parametric alternative structure for a loblolly pine data-set with known tree coordinates. Tree diameters and heights were measured and an investigation into the spatial dependence between the trees was conducted. Results showed that the most probable model for the spatial correlation structure agreed with allometric trends for loblolly pine. A combined Matern, simultaneous autoregressive model and conditional autoregressive model best described the inter-tree competition among the loblolly pine tree data considered in this research.  相似文献   

18.
We develop a new principal components analysis (PCA) type dimension reduction method for binary data. Different from the standard PCA which is defined on the observed data, the proposed PCA is defined on the logit transform of the success probabilities of the binary observations. Sparsity is introduced to the principal component (PC) loading vectors for enhanced interpretability and more stable extraction of the principal components. Our sparse PCA is formulated as solving an optimization problem with a criterion function motivated from penalized Bernoulli likelihood. A Majorization-Minimization algorithm is developed to efficiently solve the optimization problem. The effectiveness of the proposed sparse logistic PCA method is illustrated by application to a single nucleotide polymorphism data set and a simulation study.  相似文献   

19.
The effect of nonstationarity in time series columns of input data in principal components analysis is examined. Nonstationarity are very common among economic indicators collected over time. They are subsequently summarized into fewer indices for purposes of monitoring. Due to the simultaneous drifting of the nonstationary time series usually caused by the trend, the first component averages all the variables without necessarily reducing dimensionality. Sparse principal components analysis can be used, but attainment of sparsity among the loadings (hence, dimension-reduction is achieved) is influenced by the choice of parameter(s) (λ 1,i ). Simulated data with more variables than the number of observations and with different patterns of cross-correlations and autocorrelations were used to illustrate the advantages of sparse principal components analysis over ordinary principal components analysis. Sparse component loadings for nonstationary time series data can be achieved provided that appropriate values of λ 1,j are used. We provide the range of values of λ 1,j that will ensure convergence of the sparse principal components algorithm and consequently achieve sparsity of component loadings.  相似文献   

20.
This paper presents a novel ensemble classifier generation method by integrating the ideas of bootstrap aggregation and Principal Component Analysis (PCA). To create each individual member of an ensemble classifier, PCA is applied to every out-of-bag sample and the computed coefficients of all principal components are stored, and then the principal components calculated on the corresponding bootstrap sample are taken as additional elements of the original feature set. A classifier is trained with the bootstrap sample and some features randomly selected from the new feature set. The final ensemble classifier is constructed by majority voting of the trained base classifiers. The results obtained by empirical experiments and statistical tests demonstrate that the proposed method performs better than or as well as several other ensemble methods on some benchmark data sets publicly available from the UCI repository. Furthermore, the diversity-accuracy patterns of the ensemble classifiers are investigated by kappa-error diagrams.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号