首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Principal component analysis (PCA) is a widely used statistical technique for determining subscales in questionnaire data. As in any other statistical technique, missing data may both complicate its execution and interpretation. In this study, six methods for dealing with missing data in the context of PCA are reviewed and compared: listwise deletion (LD), pairwise deletion, the missing data passive approach, regularized PCA, the expectation-maximization algorithm, and multiple imputation. Simulations show that except for LD, all methods give about equally good results for realistic percentages of missing data. Therefore, the choice of a procedure can be based on the ease of application or purely the convenience of availability of a technique.  相似文献   

2.
Selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes. In this paper, we propose a flexible rank-based nonparametric procedure for gene selection from microarray data. In the method we propose a statistic for testing whether area under receiver operating characteristic curve (AUC) for each gene is equal to 0.5 allowing different variance for each gene. The contribution to this “single gene” statistic is the studentization of the empirical AUC, which takes into account the variances associated with each gene in the experiment. Delong et al. proposed a nonparametric procedure for calculating a consistent variance estimator of the AUC. We use their variance estimation technique to get a test statistic, and we focus on the primary step in the gene selection process, namely, the ranking of genes with respect to a statistical measure of differential expression. Two real datasets are analyzed to illustrate the methods and a simulation study is carried out to assess the relative performance of different statistical gene ranking measures. The work includes how to use the variance information to produce a list of significant targets and assess differential gene expressions under two conditions. The proposed method does not involve complicated formulas and does not require advanced programming skills. We conclude that the proposed methods offer useful analytical tools for identifying differentially expressed genes for further biological and clinical analysis.  相似文献   

3.
The effect of nonstationarity in time series columns of input data in principal components analysis is examined. Nonstationarity are very common among economic indicators collected over time. They are subsequently summarized into fewer indices for purposes of monitoring. Due to the simultaneous drifting of the nonstationary time series usually caused by the trend, the first component averages all the variables without necessarily reducing dimensionality. Sparse principal components analysis can be used, but attainment of sparsity among the loadings (hence, dimension-reduction is achieved) is influenced by the choice of parameter(s) (λ 1,i ). Simulated data with more variables than the number of observations and with different patterns of cross-correlations and autocorrelations were used to illustrate the advantages of sparse principal components analysis over ordinary principal components analysis. Sparse component loadings for nonstationary time series data can be achieved provided that appropriate values of λ 1,j are used. We provide the range of values of λ 1,j that will ensure convergence of the sparse principal components algorithm and consequently achieve sparsity of component loadings.  相似文献   

4.
5.
High-dimensional datasets have exploded into many fields of research, challenging our interpretation of the classic dimension reduction technique, Principal Component Analysis (PCA). Recently proposed Sparse PCA methods offer useful insight into understanding complex data structures. This article compares three Sparse PCA methods through extensive simulations, with the aim of providing guidelines as to which method to choose under a variety of data structures, as dictated by the variance-covariance matrix. A real gene expression dataset is used to illustrate an application of Sparse PCA in practice and show how to link simulation results with real-world problems.  相似文献   

6.
Principal fitted component (PFC) models are a class of likelihood-based inverse regression methods that yield a so-called sufficient reduction of the random p-vector of predictors X given the response Y. Assuming that a large number of the predictors has no information about Y, we aimed to obtain an estimate of the sufficient reduction that ‘purges’ these irrelevant predictors, and thus, select the most useful ones. We devised a procedure using observed significance values from the univariate fittings to yield a sparse PFC, a purged estimate of the sufficient reduction. The performance of the method is compared to that of penalized forward linear regression models for variable selection in high-dimensional settings.  相似文献   

7.
Principal component analysis (PCA) and functional principal analysis are key tools in multivariate analysis, in particular modelling yield curves, but little attention is given to questions of uncertainty, neither in the components themselves nor in any derived quantities such as scores. Actuaries using PCA to model yield curves to assess interest rate risk for insurance companies are required to show any uncertainty in their calculations. Asymptotic results based on assumptions of multivariate normality are unsatisfactory for modest samples, and application of bootstrap methods is not straightforward, with the novel pitfalls of possible inversions in order of sample components and reversals of signs. We present methods for overcoming these difficulties and discuss arising of other potential hazards.  相似文献   

8.
ABSTRACT

Standard statistical techniques do not provide methods for analyzing data from nonreplicated factorial experiments. Such experiments occur for several reasons. Many experimenters may prefer conducting experiments having a large number of factor levels with no replications than conducting experiments with a few factor levels with replications particularly in pilot studies. Such experiments may allow one to identify factor combinations to be used in follow-up experiments. Another possibility is when the experimenter thinks that an experiment is replicated when in fact it is not. This occurs when a naive researcher believes that sub-samples are replicates when in reality they are not. Nonreplicated two-way experiments have been extensively studied. This paper discusses the analysis of nonreplicated three-way experiments. In particular, estimation of σ2 is discussed and a test is derived for testing whether three-factor interaction is absent in sub-areas of three-way data using a nonreplicated three-way multiplicative interaction model with a single multiplicative term. Approximate null distribution of the derived test statistic is studied using Monte Carlo studies and results are illustrated through an example.  相似文献   

9.
This paper presents a study on symmetry of repeated bi-phased data signals, in particular, on quantification of the deviation between the two parts of the signal. Three symmetry scores are defined using functional data techniques such as smoothing and registration. One score is related to the L 2-distance between the two parts of the signal, whereas the other two are constructed to specifically measure differences in amplitude and phase. Moreover, symmetry scores based on functional principal component analysis (PCA) are examined. The scores are applied to acceleration signals from a study on equine gait. The scores turn out to be highly associated with lameness, and their applicability for lameness quantification and detection is investigated. Four classification approaches turn out to give similar results. The scores describing amplitude and phase variation turn out to outperform the PCA scores when it comes to the classification of lameness.  相似文献   

10.
The standard deviation of the average run length (SDARL) is an important performance metric in studying the performance of control charts with estimated in-control parameters. Only a few studies in the literature, however, have considered this measure when evaluating control chart performance. The current study aims at comparing the in-control performance of three phase II simple linear profile monitoring approaches; namely, those of Kang and Albin (2000), Kim et al. (2003), and Mahmoud et al. (2010). The comparison is performed under the assumption of estimated parameters using the SDARL metric. In general, the simulation results of the current study show that the method of Kim et al. (2003) has better overall statistical performance than the competing methods in terms of SDARL values. Some of the recommended approaches based solely on the usual average run length properties can have poor SDARL performance.  相似文献   

11.
Statistics, as one of the applied sciences, has great impacts in vast area of other sciences. Prediction of protein structures with great emphasize on their geometrical features using dihedral angles has invoked the new branch of statistics, known as directional statistics. One of the available biological techniques to predict is molecular dynamics simulations producing high-dimensional molecular structure data. Hence, it is expected that the principal component analysis (PCA) can response some related statistical problems particulary to reduce dimensions of the involved variables. Since the dihedral angles are variables on non-Euclidean space (their locus is the torus), it is expected that direct implementation of PCA does not provide great information in this case. The principal geodesic analysis is one of the recent methods to reduce the dimensions in the non-Euclidean case. A procedure to utilize this technique for reducing the dimension of a set of dihedral angles is highlighted in this paper. We further propose an extension of this tool, implemented in such way the torus is approximated by the product of two unit circle and evaluate its application in studying a real data set. A comparison of this technique with some previous methods is also undertaken.  相似文献   

12.
Most methods for survival prediction from high-dimensional genomic data combine the Cox proportional hazards model with some technique of dimension reduction, such as partial least squares regression (PLS). Applying PLS to the Cox model is not entirely straightforward, and multiple approaches have been proposed. The method of Park et al. (Bioinformatics 18(Suppl. 1):S120–S127, 2002) uses a reformulation of the Cox likelihood to a Poisson type likelihood, thereby enabling estimation by iteratively reweighted partial least squares for generalized linear models. We propose a modification of the method of park et al. (2002) such that estimates of the baseline hazard and the gene effects are obtained in separate steps. The resulting method has several advantages over the method of park et al. (2002) and other existing Cox PLS approaches, as it allows for estimation of survival probabilities for new patients, enables a less memory-demanding estimation procedure, and allows for incorporation of lower-dimensional non-genomic variables like disease grade and tumor thickness. We also propose to combine our Cox PLS method with an initial gene selection step in which genes are ordered by their Cox score and only the highest-ranking k% of the genes are retained, obtaining a so-called supervised partial least squares regression method. In simulations, both the unsupervised and the supervised version outperform other Cox PLS methods.  相似文献   

13.
Combination of multiple biomarkers to improve diagnostic accuracy is meaningful for practitioners and clinicians, and are attractive to lots of researchers. Nowadays, with development of modern techniques, functional markers such as curves or images, play an important role in diagnosis. There exists rich literature developing combination methods for continuous scalar markers. Unfortunately, only sporadic works have studied how functional markers affect diagnosis in the literature. Moreover, no publication can be found to do combination of multiple functional markers to improve the diagnostic accuracy. It is impossible to apply scalar combination methods to the multiple functional markers directly because of infinite dimensionality of functional markers. In this article, we propose a one-dimension scalar feature motivated by square loss distance, as an alternative of the original functional curve in the sense that, it can retain information to the most extent. The square loss distance is defined as the function of projection scores generated from functional principal component decomposition. Then existing variety of scalar combination methods can be applied to scalar features of functional markers after dimension reduction to improve the diagnostic accuracy. Area under the receiver operating characteristic curve and Youden index are used to assess performances of various methods in numerical studies. We also analyzed the high- or low- hospital admissions due to respiratory diseases between 2010 and 2017 in Hong Kong by combining weather conditions and media information, which are regarded as functional markers. Finally, we provide an R function for convenient application.  相似文献   

14.
ABSTRACT

Runs rules are usually used with Shewhart-type charts to enhance the charts' sensitivities toward small and moderate shifts. Abbas et al. in 2011 took it a step further by proposing two runs rules schemes, applied to the exponentially weighted moving average (EWMA) chart and evaluated their average run length (ARL) performances using simulation. They showed that the proposed schemes are superior to the classical EWMA chart and other schemes being investigated. Besides pointing out some erroneous ARL and standard deviation of the run length (SDRL) computations in Abbas et al., this paper presents a Markov chain approach for computing the ARL, percentiles of the run length (RL) distribution and SDRL, for the two runs rules schemes of Abbas et al. Using Markov chain, we also propose two combined runs rules EWMA schemes to quicken the two schemes of Abbas et al. in responding to large shifts. The runs rules (basic and combined rules) EWMA schemes will be compared with some existing control charting methods, where the former charts are shown to prevail.  相似文献   

15.
In group sequential clinical trials, there are several sample size re-estimation methods proposed in the literature that allow for change of sample size at the interim analysis. Most of these methods are based on either the conditional error function or the interim effect size. Our simulation studies compared the operating characteristics of three commonly used sample size re-estimation methods, Chen et al. (2004), Cui et al. (1999), and Muller and Schafer (2001). Gao et al. (2008) extended the CDL method and provided an analytical expression of lower and upper threshold of conditional power where the type I error is preserved. Recently, Mehta and Pocock (2010) extensively discussed that the real benefit of the adaptive approach is to invest the sample size resources in stages and increasing the sample size only if the interim results are in the so called “promising zone” which they define in their article. We incorporated this concept in our simulations while comparing the three methods. To test the robustness of these methods, we explored the impact of incorrect variance assumption on the operating characteristics. We found that the operating characteristics of the three methods are very comparable. In addition, the concept of promising zone, as suggested by MP, gives the desired power and smaller average sample size, and thus increases the efficiency of the trial design.  相似文献   

16.
We propose a novel method for tensorial‐independent component analysis. Our approach is based on TJADE and k‐JADE, two recently proposed generalizations of the classical JADE algorithm. Our novel method achieves the consistency and the limiting distribution of TJADE under mild assumptions and at the same time offers notable improvement in computational speed. Detailed mathematical proofs of the statistical properties of our method are given and, as a special case, a conjecture on the properties of k‐JADE is resolved. Simulations and timing comparisons demonstrate remarkable gain in speed. Moreover, the desired efficiency is obtained approximately for finite samples. The method is applied successfully to large‐scale video data, for which neither TJADE nor k‐JADE is feasible. Finally, an experimental procedure is proposed to select the values of a set of tuning parameters. Supplementary material including the R‐code for running the examples and the proofs of the theoretical results is available online.  相似文献   

17.
Abstract

In this paper, we consider the estimation of a sensitive character when the population is consisted of several strata; this is undertaken by applying Niharika et al.’s model which is using geometric distribution as a randomization device. A sensitive parameter is estimated for the case in which stratum size is known, and proportional and optimum allocation methods are taken into account. We extended the Niharika et al.’s model to the case of an unknown stratum size; a sensitive parameter is estimated by applying stratified double sampling to the Niharika et al.’s model. Finally, the efficiency of the proposed model is compared with that of Niharika et al. in terms of the estimator variance.  相似文献   

18.
ABSTRACT

We propose a multiple imputation method based on principal component analysis (PCA) to deal with incomplete continuous data. To reflect the uncertainty of the parameters from one imputation to the next, we use a Bayesian treatment of the PCA model. Using a simulation study and real data sets, the method is compared to two classical approaches: multiple imputation based on joint modelling and on fully conditional modelling. Contrary to the others, the proposed method can be easily used on data sets where the number of individuals is less than the number of variables and when the variables are highly correlated. In addition, it provides unbiased point estimates of quantities of interest, such as an expectation, a regression coefficient or a correlation coefficient, with a smaller mean squared error. Furthermore, the widths of the confidence intervals built for the quantities of interest are often smaller whilst ensuring a valid coverage.  相似文献   

19.
We explore the construction of new symplectic numerical integration schemes to be used in Hamiltonian Monte Carlo and study their efficiency. Integration schemes from Blanes et al., and a new scheme are considered as candidates to the commonly used leapfrog method. All integration schemes are tested within the framework of the No-U-Turn sampler (NUTS), both for a logistic regression model and a student t-model. The results show that the leapfrog method is inferior to all the new methods both in terms of asymptotic expected acceptance probability for a model problem and the efficient sample size per computing time for the realistic models.  相似文献   

20.
Modern technologies are frequently used in order to deal with new genomic problems. For instance, the STRUCTURE software is usually employed for breed assignment based on genetic information. However, standard statistical techniques offer a number of valuable tools which can be successfully used for dealing with most problems. In this paper, we investigated the capability of microsatellite markers for individual identification and their potential use for breed assignment of individuals in seventy Lidia breed lines and breeders. Traditional binomial logistic regression is applied to each line and used to assign one individual to a particular line. In addition, the area under receiver operating curve (AUC) criterion is used to measure the capability of the microsatellite-based models to separate the groups. This method allows us to identify which microsatellite loci are related to each line. Overall, only one subject was misclassified or a 99.94% correct allocation. The minimum observed AUC was 0.986 with an average of 0.997. These results suggest that our method is competitive for animal allocation and has some interpretative advantages and a strong relationship with methods based on SNPs and related techniques.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号