首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Differential analysis techniques are commonly used to offer scientists a dimension reduction procedure and an interpretable gateway to variable selection, especially when confronting high-dimensional genomic data. Huang et al. used a gene expression profile of breast cancer cell lines to identify genomic markers which are highly correlated with in vitro sensitivity of a drug Dasatinib. They considered three statistical methods to identify differentially expressed genes and finally used the results from the intersection. But the statistical methods that are used in the paper are not sufficient to select the genomic markers. In this paper we used three alternative statistical methods to select a combined list of genomic markers and compared the genes that were proposed by Huang et al. We then proposed to use sparse principal component analysis (Sparse PCA) to identify a final list of genomic markers. The Sparse PCA incorporates correlation into account among the genes and helps to draw a successful genomic markers discovery. We present a new and a small set of genomic markers to separate out the groups of patients effectively who are sensitive to the drug Dasatinib. The analysis procedure will also encourage scientists in identifying genomic markers that can help to separate out two groups.  相似文献   

2.
Variable selection methods have been widely used in the analysis of high-dimensional data, for example, gene expression microarray data and single nucleotide polymorphism data. A special feature of the genomic data is that genes participating in a common metabolic pathway or sharing a similar biological function tend to have high correlations. The collinearity naturally embedded in these data requires special handling, which cannot be provided by existing variable selection methods. In this paper, we propose a set of new methods to select variables in correlated data. The new methods follow the forward selection procedure of least angle regression (LARS) but conduct grouping and selecting at the same time. The methods specially work when no prior information on group structures of data is available. Simulations and real examples show that our proposed methods often outperform the existing variable selection methods, including LARS and elastic net, in terms of both reducing prediction error and preserving sparsity of representation.  相似文献   

3.
In many complex diseases such as cancer, a patient undergoes various disease stages before reaching a terminal state (say disease free or death). This fits a multistate model framework where a prognosis may be equivalent to predicting the state occupation at a future time t. With the advent of high-throughput genomic and proteomic assays, a clinician may intent to use such high-dimensional covariates in making better prediction of state occupation. In this article, we offer a practical solution to this problem by combining a useful technique, called pseudo-value (PV) regression, with a latent factor or a penalized regression method such as the partial least squares (PLS) or the least absolute shrinkage and selection operator (LASSO), or their variants. We explore the predictive performances of these combinations in various high-dimensional settings via extensive simulation studies. Overall, this strategy works fairly well provided the models are tuned properly. Overall, the PLS turns out to be slightly better than LASSO in most settings investigated by us, for the purpose of temporal prediction of future state occupation. We illustrate the utility of these PV-based high-dimensional regression methods using a lung cancer data set where we use the patients’ baseline gene expression values.  相似文献   

4.
In clinical studies, the researchers measure the patients' response longitudinally. In recent studies, Mixed models are used to determine effects in the individual level. In the other hand, Henderson et al. [3,4] developed a joint likelihood function which combines likelihood functions of longitudinal biomarkers and survival times. They put random effects in the longitudinal component to determine if a longitudinal biomarker is associated with time to an event. In this paper, we deal with a longitudinal biomarker as a growth curve and extend Henderson's method to determine if a longitudinal biomarker is associated with time to an event for the multivariate survival data.  相似文献   

5.
This article explores an ‘Edge Selection’ procedure to fit an undirected graph to a given data set. Undirected graphs are routinely used to represent, model and analyse associative relationships among the entities on a social, biological or genetic network. Our proposed method combines the computational efficiency of least angle regression and at the same time ensures symmetry of the selected adjacency matrix. Various local and global properties of the edge selection path are explored analytically. In particular, a suitable parameter that controls the amount of shrinkage is identified and we consider several cross-validation techniques to choose an accurate predictive model on the path. The proposed method is illustrated with a detailed simulation study involving models with various levels of sparsity and variability in the nodal degree distributions. Finally, our method is used to select undirected graphs from various real data sets. We employ it for identifying the regulatory network of isoprenoid pathways from a gene-expression data and also to identify genetic network from a high-dimensional breast cancer study data.  相似文献   

6.
This paper addresses the problem of simultaneous variable selection and estimation in the random-intercepts model with the first-order lag response. This type of model is commonly used for analyzing longitudinal data obtained through repeated measurements on individuals over time. This model uses random effects to cover the intra-class correlation, and the first lagged response to address the serial correlation, which are two common sources of dependency in longitudinal data. We demonstrate that the conditional likelihood approach by ignoring correlation among random effects and initial responses can lead to biased regularized estimates. Furthermore, we demonstrate that joint modeling of initial responses and subsequent observations in the structure of dynamic random-intercepts models leads to both consistency and Oracle properties of regularized estimators. We present theoretical results in both low- and high-dimensional settings and evaluate regularized estimators' performances by conducting simulation studies and analyzing a real dataset. Supporting information is available online.  相似文献   

7.
孙怡帆等 《统计研究》2021,38(5):136-146
随着信息技术的发展,高维数据日益丰富。现实中,很多高维数据由多个主体各异的数据集融合而成。如何准确识别出高维数据集间的异同性成为大数据分析的目标之一。本文提出了变系数模型下的高维数据整合分析方法。该方法可以同时对多个数据集进行变量选择和系数估计,并且能 够自动识别出变量系数在数据集间的异同性。模拟结果表明本文方法在异同性识别、变量选择、系数估 计和预测等方面明显优于对比方法。在肺癌致病基因识别的应用研究中,本文方法能够识别出具有生物解释的致病基因并发现了两种亚型之间的异同性。  相似文献   

8.
In panel data analysis, predictors may impact response in substantially different manner. Some predictors are in homogenous effects across all individuals, while the others are in heterogenous way. How to effectively differentiate these two kinds of predictors is crucial, particularly in high-dimensional panel data, since the number of parameters should be greatly reduced and hence lead to better interpretability by homogenous assumption. In this article, based on a hierarchical Bayesian panel regression model, we propose a novel yet effective Markov chain Monte Carlo (MCMC) algorithm together with a simple maximum ratio criterion to detect the predictors in homogenous effects in high-dimensional panel data. Extensive Monte Carlo simulations show that this MCMC algorithm performs well. The usefulness of the proposed method is further demonstrated by a real example from China financial market.  相似文献   

9.
In biomedical research, profiling is now commonly conducted, generating high-dimensional genomic measurements (without loss of generality, say genes). An important analysis objective is to rank genes according to their marginal associations with a disease outcome/phenotype. Clinical-covariates, including for example clinical risk factors and environmental exposures, usually exist and need to be properly accounted for. In this study, we propose conducting marginal ranking of genes using a receiver operating characteristic (ROC) based method. This method can accommodate categorical, censored survival, and continuous outcome variables in a very similar manner. Unlike logistic-model-based methods, it does not make very specific assumptions on model, making it robust. In ranking genes, we account for both the main effects of clinical-covariates and their interactions with genes, and develop multiple diagnostic accuracy improvement measurements. Using simulation studies, we show that the proposed method is effective in that genes associated with or gene–covariate interactions associated with the outcome receive high rankings. In data analysis, we observe some differences between the rankings using the proposed method and the logistic-model-based method.  相似文献   

10.
High-throughput profiling is now common in biomedical research. In this paper we consider the layout of an etiology study composed of a failure time response, and gene expression measurements. In current practice, a widely adopted approach is to select genes according to a preliminary marginal screening and a follow-up penalized regression for model building. Confounders, including for example clinical risk factors and environmental exposures, usually exist and need to be properly accounted for. We propose covariate-adjusted screening and variable selection procedures under the accelerated failure time model. While penalizing the high-dimensional coefficients to achieve parsimonious model forms, our procedure also properly adjust the low-dimensional confounder effects to achieve more accurate estimation of regression coefficients. We establish the asymptotic properties of our proposed methods and carry out simulation studies to assess the finite sample performance. Our methods are illustrated with a real gene expression data analysis where proper adjustment of confounders produces more meaningful results.  相似文献   

11.
In this paper, we consider the joint modelling of survival and longitudinal data with informative observation time points. The survival model and the longitudinal model are linked via random effects, for which no distribution assumption is required under our estimation approach. The estimator is shown to be consistent and asymptotically normal. The proposed estimator and its estimated covariance matrix can be easily calculated. Simulation studies and an application to a primary biliary cirrhosis study are also provided.  相似文献   

12.
In this article we present a technique for implementing large-scale optimal portfolio selection. We use high-frequency daily data to capture valuable statistical information in asset returns. We describe several statistical issues involved in quantitative approaches to portfolio selection. Our methodology applies to large-scale portfolio-selection problems in which the number of possible holdings is large relative to the estimation period provided by historical data. We illustrate our approach on an equity database that consists of stocks from the Standard and Poor's index, and we compare our portfolios to this benchmark index. Our methodology differs from the usual quadratic programming approach to portfolio selection in three ways: (1) We employ informative priors on the expected returns and variance-covariance matrices, (2) we use daily data for estimation purposes, with upper and lower holding limits for individual securities, and (3) we use a dynamic asset-allocation approach that is based on reestimating and then rebalancing the portfolio weights on a prespecified time window. The key inputs to the optimization process are the predictive distributions of expected returns and the predictive variance-covariance matrix. We describe the statistical issues involved in modeling these inputs for high-dimensional portfolio problems in which our data frequency is daily. In our application, we find that our optimal portfolio outperforms the underlying benchmark.  相似文献   

13.
Many different models for the analysis of high-dimensional survival data have been developed over the past years. While some of the models and implementations come with an internal parameter tuning automatism, others require the user to accurately adjust defaults, which often feels like a guessing game. Exhaustively trying out all model and parameter combinations will quickly become tedious or infeasible in computationally intensive settings, even if parallelization is employed. Therefore, we propose to use modern algorithm configuration techniques, e.g. iterated F-racing, to efficiently move through the model hypothesis space and to simultaneously configure algorithm classes and their respective hyperparameters. In our application we study four lung cancer microarray data sets. For these we configure a predictor based on five survival analysis algorithms in combination with eight feature selection filters. We parallelize the optimization and all comparison experiments with the BatchJobs and BatchExperiments R packages.  相似文献   

14.
The problem of interpreting lung-function measurements in industrial workers is examined. The data under discussion pertain to FEV1 and FVC measurements in smoking and in nonsmoking groups of grain-elevator workers in British Columbia and of workers in Vancouver City Hall. Initial observations have now been enriched by longitudinal follow up data on the same groups after three and after six years. It is shown that interesting selection phenomena, favouring “fit” individuals, take place over time, with regard both to lung symptoms and lung functions. Thus cross-sectional and longitudinal studies refer to somewhat different populations. It also appears that longitudinal studies are considerably more sensitive to identifying cumulative lung damage than are corresponding cross-sectional studies. The nonlinearity of the effect of age on lung functions is noted in the longitudinal data in a number of cases, lending support to the hypothesis of association between quadratic age effect and cumulative exposure to lung insults.  相似文献   

15.
This work is motivated by a quantitative Magnetic Resonance Imaging study of the differential tumor/healthy tissue change in contrast uptake induced by radiation. The goal is to determine the time in which there is maximal contrast uptake (a surrogate for permeability) in the tumor relative to healthy tissue. A notable feature of the data is its spatial heterogeneity. Zhang, Johnson, Little, and Cao (2008a and 2008b) discuss two parallel approaches to "denoise" a single image of change in contrast uptake from baseline to one follow-up visit of interest. In this work we extend the image model to explore the longitudinal profile of the tumor/healthy tissue contrast uptake in multiple images over time. We fit a two-stage model. First, we propose a longitudinal image model for each subject. This model simultaneously accounts for the spatial and temporal correlation and denoises the observed images by borrowing strength both across neighboring pixels and over time. We propose to use the Mann-Whitney U statistic to summarize the tumor contrast uptake relative to healthy tissue. In the second stage, we fit a population model to the U statistic and estimate when it achieves its maximum. Our initial findings suggest that the maximal contrast uptake of the tumor core relative to healthy tissue peaks around three weeks after initiation of radiotherapy, though this warrants further investigation.  相似文献   

16.
Growth curve analysis is beneficial in longitudinal studies, where the pattern of response variables measured repeatedly over time is of interest, yet unknown. In this article, we propose generalized growth curve models under a polynomial regression framework and offer a complete process that identifies the parsimonious growth curves for different groups of interest, as well as compares the curves. A higher order of a polynomial degree generally provides more flexible regression, yet it may suffer from the complicated and overfitted model in practice. Therefore, we employ the model selection procedure that chooses the optimal degree of a polynomial consistently. Consideration of a quadratic inference function (Qu et al., 2000) for estimation on regression parameters is addressed and estimation efficiency is improved by incorporating the within-subject correlation commonly existing in longitudinal data. In biomedical studies, it is of particular interest to compare multiple treatments and provide an effective one. We further conduct the hypothesis test that assesses the equality of the growth curves through an asymptotic chi-square test statistic. The proposed methodology is employed on a randomized controlled longitudinal dataset on depression. The effectiveness of our procedure is also confirmed with simulation studies.  相似文献   

17.
In this paper, we introduce the empirical likelihood (EL) method to longitudinal studies. By considering the dependence within subjects in the auxiliary random vectors, we propose a new weighted empirical likelihood (WEL) inference for generalized linear models with longitudinal data. We show that the weighted empirical likelihood ratio always follows an asymptotically standard chi-squared distribution no matter which working weight matrix that we have chosen, but a well chosen working weight matrix can improve the efficiency of statistical inference. Simulations are conducted to demonstrate the accuracy and efficiency of our proposed WEL method, and a real data set is used to illustrate the proposed method.  相似文献   

18.
In this paper, we propose a multivariate growth curve mixture model that groups subjects based on multiple symptoms measured repeatedly over time. Our model synthesizes features of two models. First, we follow Roy and Lin (2000) in relating the multiple symptoms at each time point to a single latent variable. Second, we use the growth mixture model of Muthén and Shedden (1999) to group subjects based on distinctive longitudinal profiles of this latent variable. The mean growth curve for the latent variable in each class defines that class's features. For example, a class of "responders" would have a decline in the latent symptom summary variable over time. A Bayesian approach to estimation is employed where the methods of Elliott et al (2005) are extended to simultaneously estimate the posterior distributions of the parameters from the latent variable and growth curve mixture portions of the model. We apply our model to data from a randomized clinical trial evaluating the efficacy of Bacillus Calmette-Guerin (BCG) in treating symptoms of Interstitial Cystitis. In contrast to conventional approaches using a single subjective Global Response Assessment, we use the multivariate symptom data to identify a class of subjects where treatment demonstrates effectiveness. Simulations are used to confirm identifiability results and evaluate the performance of our algorithm. The definitive version of this paper is available at onlinelibrary.wiley.com.  相似文献   

19.
In recent years, joint analysis of longitudinal measurements and survival data has received much attention. However, previous work has primarily focused on a single failure type for the event time. In this article, we consider joint modeling of repeated measurements and competing risks failure time data to allow for more than one distinct failure type in the survival endpoint so we fit a cause-specific hazards sub-model to allow for competing risks, with a separate latent association between longitudinal measurements and each cause of failure. Besides, previous work does not focus on the hypothesis to test a separate latent association between longitudinal measurements and each cause of failure. In this article, we derive a score test to identify longitudinal biomarkers or surrogates for a time to event outcome in competing risks data. With a carefully chosen definition of complete data, the maximum likelihood estimation of the cause-specific hazard functions is performed via an EM algorithm. We extend this work and allow random effects to be present in both the longitudinal biomarker and underlying survival function. The random effects in the biomarker are introduced via an explicit term while the random effect in the underlying survival function is introduced by the inclusion of frailty into the model.

We use simulations to explore how the number of individuals, the number of time points per individual and the functional form of the random effects from the longitudinal biomarkers considering heterogeneous baseline hazards in individuals influence the power to detect the association of a longitudinal biomarker and the survival time.  相似文献   


20.
Time-course gene sets are collections of predefined groups of genes in some patients gathered over time. The analysis of time-course gene sets for testing gene sets which vary significantly over time is an important context in genomic data analysis. In this paper, the method of generalized estimating equations (GEEs), which is a semi-parametric approach, is applied to time-course gene set data. We propose a special structure of working correlation matrix to handle the association among repeated measurements of each patient over time. Also, the proposed working correlation matrix permits estimation of the effects of the same gene among different patients. The proposed approach is applied to an HIV therapeutic vaccine trial (DALIA-1 trial). This data set has two phases: pre-ATI and post-ATI which depend on a vaccination period. Using multiple testing, the significant gene sets in the pre-ATI phase are detected and data on two randomly selected gene sets in the post-ATI phase are also analyzed. Some simulation studies are performed to illustrate the proposed approaches. The results of the simulation studies confirm the good performance of our proposed approach.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号