首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 343 毫秒
1.
Gene regulatory networks are collections of genes that interact with one other and with other substances in the cell. By measuring gene expression over time using high-throughput technologies, it may be possible to reverse engineer, or infer, the structure of the gene network involved in a particular cellular process. These gene expression data typically have a high dimensionality and a limited number of biological replicates and time points. Due to these issues and the complexity of biological systems, the problem of reverse engineering networks from gene expression data demands a specialized suite of statistical tools and methodologies. We propose a non-standard adaptation of a simulation-based approach known as Approximate Bayesian Computing based on Markov chain Monte Carlo sampling. This approach is particularly well suited for the inference of gene regulatory networks from longitudinal data. The performance of this approach is investigated via simulations and using longitudinal expression data from a genetic repair system in Escherichia coli.  相似文献   

2.
The most common assumption in geostatistical modeling of malaria is stationarity, that is spatial correlation is a function of the separation vector between locations. However, local factors (environmental or human-related activities) may influence geographical dependence in malaria transmission differently at different locations, introducing non-stationarity. Ignoring this characteristic in malaria spatial modeling may lead to inaccurate estimates of the standard errors for both the covariate effects and the predictions. In this paper, a model based on random Voronoi tessellation that takes into account non-stationarity was developed. In particular, the spatial domain was partitioned into sub-regions (tiles), a stationary spatial process was assumed within each tile and between-tile correlation was taken into account. The number and configuration of the sub-regions are treated as random parameters in the model and inference is made using reversible jump Markov chain Monte Carlo simulation. This methodology was applied to analyze malaria survey data from Mali and to produce a country-level smooth map of malaria risk.  相似文献   

3.
4.
5.
Canine hip dysplasia (CHD) is characterized by hip laxity and subluxation that can lead to hip osteoarthritis. Studies have shown the involvement of multiple genetic regions in the expression of CHD. Although we have associated some variants in the region of fibrillin 2 with CHD in a subset of dogs, no major disease-associated gene has been identified. The focus of this study is to identify quantitative trait loci (QTL) associated with CHD. Two sequential multipoint linkage analyses based on a reversible jump Markov chain Monte Carlo approach were applied on a cross-breed pedigree of 366 dogs. Hip radiographic trait (Norberg Angle, NA) on both hips of each dog was tested for linkage to 21,455 single nucleotide polymorphisms across 39 chromosomes. Putative QTL for the NA was found on 11 chromosomes (1, 2, 3, 4, 7, 14, 19, 21, 32, 36, and 39). Identification of genes in the QTL region(s) can assist in identification of the aberrant genes and biochemical pathways involving hip dysplasia in both dogs and humans.  相似文献   

6.
In the field of molecular biology, it is often of interest to analyze microarray data for clustering genes based on similar profiles of gene expression to identify genes that are differentially expressed under multiple biological conditions. One of the notable characteristics of a gene expression profile is that it shows a cyclic curve over a course of time. To group sequences of similar molecular functions, we propose a Bayesian Dirichlet process mixture of linear regression models with a Fourier series for the regression coefficients, for each of which a spike and slab prior is assumed. A full Gibbs-sampling algorithm is developed for an efficient Markov chain Monte Carlo (MCMC) posterior computation. Due to the so-called “label-switching” problem and different numbers of clusters during the MCMC computation, a post-process approach of Fritsch and Ickstadt (2009) is additionally applied to MCMC samples for an optimal single clustering estimate by maximizing the posterior expected adjusted Rand index with the posterior probabilities of two observations being clustered together. The proposed method is illustrated with two simulated data and one real data of the physiological response of fibroblasts to serum of Iyer et al. (1999).  相似文献   

7.
Many applications of statistical methods for data that are spatially correlated require the researcher to specify the correlation structure of the data. This can be a difficult task as there are many candidate structures. Some spatial correlation structures depend on the distance between the observed data points while others rely on neighborhood structures. In this paper, Bayesian methods that systematically determine the ‘best’ correlation structure from a predefined class of structures are proposed. Bayes factors, Highest Probability Models, and Bayesian Model Averaging are employed to determine the ‘best’ correlation structure and to average across these structures to create a non-parametric alternative structure for a loblolly pine data-set with known tree coordinates. Tree diameters and heights were measured and an investigation into the spatial dependence between the trees was conducted. Results showed that the most probable model for the spatial correlation structure agreed with allometric trends for loblolly pine. A combined Matern, simultaneous autoregressive model and conditional autoregressive model best described the inter-tree competition among the loblolly pine tree data considered in this research.  相似文献   

8.
We consider the problem of statistically evaluating the similarity of DNA intronic regions of genes. Present algorithms are based on matching a sequence of interest with known DNA sequences in a gene bank and are designed primarily to assess homology among exonic regions of genes. Most research focuses on exonic regions because they have a clear biological significance, coding for proteins, and therefore tend to be more conserved in evolution than intronic regions. To investigate whether the intronic features of genes whose expression is highly sensitive to environmental perturbations differ from genes that have a more constant expression, a collection of oncogenes, tumor suppressor genes, and nonregulatory genes involved in energy metabolism are compared. An analysis of the features of these genes' intronic regions result in clustering by regulatory group. In addition, Billingsley's test for Markov structure (1961) suggests that 67% of the intronic regions in this collection of genes show evidence of nonrandom structure, indicating the possibility of a biological function for these regions. The result of Billingsley's test for homology is used as input to a clustering algorithm. The biological significance of this methodology lies in the identification of groups based on the intronic regions from genes of unknown function. With the advent of rapid sequencing techniques, there is a great need for statistical techniques to help identify the purpose of poorly understood portions of genes. These methods can be utilized to assess the functional group to which such a gene might possibly belong.  相似文献   

9.
The estimation of Bayesian networks given high‐dimensional data, in particular gene expression data, has been the focus of much recent research. Whilst there are several methods available for the estimation of such networks, these typically assume that the data consist of independent and identically distributed samples. It is often the case, however, that the available data have a more complex mean structure, plus additional components of variance, which must then be accounted for in the estimation of a Bayesian network. In this paper, score metrics that take account of such complexities are proposed for use in conjunction with score‐based methods for the estimation of Bayesian networks. We propose first, a fully Bayesian score metric, and second, a metric inspired by the notion of restricted maximum likelihood. We demonstrate the performance of these new metrics for the estimation of Bayesian networks using simulated data with known complex mean structures. We then present the analysis of expression levels of grape‐berry genes adjusting for exogenous variables believed to affect the expression levels of the genes. Demonstrable biological effects can be inferred from the estimated conditional independence relationships and correlations amongst the grape‐berry genes.  相似文献   

10.
Summary. In geostatistics it is common practice to assume that the underlying spatial process is stationary and isotropic, i.e. the spatial distribution is unchanged when the origin of the index set is translated and under rotation about the origin. However, in environmental problems, such assumptions are not realistic since local influences in the correlation structure of the spatial process may be found in the data. The paper proposes a Bayesian model to address the anisot- ropy problem. Following Sampson and Guttorp, we define the correlation function of the spatial process by reference to a latent space, denoted by D , where stationarity and isotropy hold. The space where the gauged monitoring sites lie is denoted by G . We adopt a Bayesian approach in which the mapping between G and D is represented by an unknown function d (·). A Gaussian process prior distribution is defined for d (·). Unlike the Sampson–Guttorp approach, the mapping of both gauged and ungauged sites is handled in a single framework, and predictive inferences take explicit account of uncertainty in the mapping. Markov chain Monte Carlo methods are used to obtain samples from the posterior distributions. Two examples are discussed: a simulated data set and the solar radiation data set that also was analysed by Sampson and Guttorp.  相似文献   

11.
Modeling spatial interactions that arise in spatially referenced data is commonly done by incorporating the spatial dependence into the covariance structure either explicitly or implicitly via an autoregressive model. In the case of lattice (regional summary) data, two common autoregressive models used are the conditional autoregressive model (CAR) and the simultaneously autoregressive model (SAR). Both of these models produce spatial dependence in the covariance structure as a function of a neighbor matrix W and often a fixed unknown spatial correlation parameter. This paper examines in detail the correlation structures implied by these models as applied to an irregular lattice in an attempt to demonstrate their many counterintuitive or impractical results. A data example is used for illustration where US statewide average SAT verbal scores are modeled and examined for spatial structure using different spatial models.  相似文献   

12.
In spatial epidemiology, detecting areas with high ratio of disease is important as it may lead to identifying risk factors associated with disease. This in turn may lead to further epidemiological investigations into the nature of disease. Disease mapping studies have been widely performed with considering only one disease in the estimated models. Simultaneous modelling of different diseases can also be a valuable tool both from the epidemiological and also from the statistical point of view. In particular, when we have several measurements recorded at each spatial location, one can consider multivariate models in order to handle the dependence among the multivariate components and the spatial dependence between locations. In this paper, spatial models that use multivariate conditionally autoregressive smoothing across the spatial dimension are considered. We study the patterns of incidence ratios and identify areas with consistently high ratio estimates as areas for further investigation. A hierarchical Bayesian approach using Markov chain Monte Carlo techniques is employed to simultaneously examine spatial trends of asthma visits by children and adults to hospital in the province of Manitoba, Canada, during 2000–2010.  相似文献   

13.
Markov chain Monte Carlo (MCMC) algorithms for Bayesian computation for Gaussian process-based models under default parameterisations are slow to converge due to the presence of spatial- and other-induced dependence structures. The main focus of this paper is to study the effect of the assumed spatial correlation structure on the convergence properties of the Gibbs sampler under the default non-centred parameterisation and a rival centred parameterisation (CP), for the mean structure of a general multi-process Gaussian spatial model. Our investigation finds answers to many pertinent, but as yet unanswered, questions on the choice between the two. Assuming the covariance parameters to be known, we compare the exact rates of convergence of the two by varying the strength of the spatial correlation, the level of covariance tapering, the scale of the spatially varying covariates, the number of data points, the number and the structure of block updating of the spatial effects and the amount of smoothness assumed in a Matérn covariance function. We also study the effects of introducing differing levels of geometric anisotropy in the spatial model. The case of unknown variance parameters is investigated using well-known MCMC convergence diagnostics. A simulation study and a real-data example on modelling air pollution levels in London are used for illustrations. A generic pattern emerges that the CP is preferable in the presence of more spatial correlation or more information obtained through, for example, additional data points or by increased covariate variability.  相似文献   

14.
Selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes. In this paper, we propose a flexible rank-based nonparametric procedure for gene selection from microarray data. In the method we propose a statistic for testing whether area under receiver operating characteristic curve (AUC) for each gene is equal to 0.5 allowing different variance for each gene. The contribution to this “single gene” statistic is the studentization of the empirical AUC, which takes into account the variances associated with each gene in the experiment. Delong et al. proposed a nonparametric procedure for calculating a consistent variance estimator of the AUC. We use their variance estimation technique to get a test statistic, and we focus on the primary step in the gene selection process, namely, the ranking of genes with respect to a statistical measure of differential expression. Two real datasets are analyzed to illustrate the methods and a simulation study is carried out to assess the relative performance of different statistical gene ranking measures. The work includes how to use the variance information to produce a list of significant targets and assess differential gene expressions under two conditions. The proposed method does not involve complicated formulas and does not require advanced programming skills. We conclude that the proposed methods offer useful analytical tools for identifying differentially expressed genes for further biological and clinical analysis.  相似文献   

15.
We propose a hierarchical Bayesian model for analyzing gene expression data to identify pathways differentiating between two biological states (e.g., cancer vs. non-cancer and mutant vs. normal). Finding significant pathways can improve our understanding of biological processes. When the biological process of interest is related to a specific disease, eliciting a better understanding of the underlying pathways can lead to designing a more effective treatment. We apply our method to data obtained by interrogating the mutational status of p53 in 50 cancer cell lines (33 mutated and 17 normal). We identify several significant pathways with strong biological connections. We show that our approach provides a natural framework for incorporating prior biological information, and it has the best overall performance in terms of correctly identifying significant pathways compared to several alternative methods.  相似文献   

16.
Variable selection methods have been widely used in the analysis of high-dimensional data, for example, gene expression microarray data and single nucleotide polymorphism data. A special feature of the genomic data is that genes participating in a common metabolic pathway or sharing a similar biological function tend to have high correlations. The collinearity naturally embedded in these data requires special handling, which cannot be provided by existing variable selection methods. In this paper, we propose a set of new methods to select variables in correlated data. The new methods follow the forward selection procedure of least angle regression (LARS) but conduct grouping and selecting at the same time. The methods specially work when no prior information on group structures of data is available. Simulations and real examples show that our proposed methods often outperform the existing variable selection methods, including LARS and elastic net, in terms of both reducing prediction error and preserving sparsity of representation.  相似文献   

17.
Massively Parallel Signature Sequencing (MPSS) is a high-throughput counting-based technology available for gene expression profiling. It produces output that is similar to Serial Analysis of Gene Expression (SAGE) and is ideal for building complex relational databases for gene expression. Our goal is to compare the in vivo global gene expression profiles of tissues infected with different strains of Salmonella obtained using the MPSS technology. In this article, we develop an exact ANOVA type model for this count data using a zero-inflated Poisson (ZIP) distribution, different from existing methods that assume continuous densities. We adopt two Bayesian hierarchical models-one parametric and the other semiparametric with a Dirichlet process prior that has the ability to "borrow strength" across related signatures, where a signature is a specific arrangement of the nucleotides, usually 16-21 base-pairs long. We utilize the discreteness of Dirichlet process prior to cluster signatures that exhibit similar differential expression profiles. Tests for differential expression are carried out using non-parametric approaches, while controlling the false discovery rate. We identify several differentially expressed genes that have important biological significance and conclude with a summary of the biological discoveries.  相似文献   

18.
Summary.  The importance of incorporating existing biological knowledge, such as gene functional annotations in gene ontology, in analysing high throughput genomic and proteomic data is being increasingly recognized. In the context of detecting differential gene expression, however, the current practice of using gene annotations is limited primarily to validations. Here we take a direct approach to incorporating gene annotations into mixture models for analysis. First, in contrast with a standard mixture model assuming that each gene of the genome has the same distribution, we study stratified mixture models allowing genes with different annotations to have different distributions, such as prior probabilities. Second, rather than treating parameters in stratified mixture models independently, we propose a hierarchical model to take advantage of the hierarchical structure of most gene annotation systems, such as gene ontology. We consider a simplified implementation for the proof of concept. An application to a mouse microarray data set and a simulation study demonstrate the improvement of the two new approaches over the standard mixture model.  相似文献   

19.
The microarray technology allows the measurement of expression levels of thousands of genes simultaneously. The dimension and complexity of gene expression data obtained by microarrays create challenging data analysis and management problems ranging from the analysis of images produced by microarray experiments to biological interpretation of results. Therefore, statistical and computational approaches are beginning to assume a substantial position within the molecular biology area. We consider the problem of simultaneously clustering genes and tissue samples (in general conditions) of a microarray data set. This can be useful for revealing groups of genes involved in the same molecular process as well as groups of conditions where this process takes place. The need of finding a subset of genes and tissue samples defining a homogeneous block had led to the application of double clustering techniques on gene expression data. Here, we focus on an extension of standard K-means to simultaneously cluster observations and features of a data matrix, namely double K-means introduced by Vichi (2000). We introduce this model in a probabilistic framework and discuss the advantages of using this approach. We also develop a coordinate ascent algorithm and test its performance via simulation studies and real data set. Finally, we validate the results obtained on the real data set by building resampling confidence intervals for block centroids.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号