首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, we study the change-point inference problem motivated by the genomic data that were collected for the purpose of monitoring DNA copy number changes. DNA copy number changes or copy number variations (CNVs) correspond to chromosomal aberrations and signify abnormality of a cell. Cancer development or other related diseases are usually relevant to DNA copy number changes on the genome. There are inherited random noises in such data, therefore, there is a need to employ an appropriate statistical model for identifying statistically significant DNA copy number changes. This type of statistical inference is evidently crucial in cancer researches, clinical diagnostic applications, and other related genomic researches. For the high-throughput genomic data resulting from DNA copy number experiments, a mean and variance change point model (MVCM) for detecting the CNVs is appropriate. We propose to use a Bayesian approach to study the MVCM for the cases of one change and propose to use a sliding window to search for all CNVs on a given chromosome. We carry out simulation studies to evaluate the estimate of the locus of the DNA copy number change using the derived posterior probability. These simulation results show that the approach is suitable for identifying copy number changes. The approach is also illustrated on several chromosomes from nine fibroblast cancer cell line data (array-based comparative genomic hybridization data). All DNA copy number aberrations that have been identified and verified by karyotyping are detected by our approach on these cell lines.  相似文献   

2.
Genomic alterations have been linked to the development and progression of cancer. The technique of comparative genomic hybridization (CGH) yields data consisting of fluorescence intensity ratios of test and reference DNA samples. The intensity ratios provide information about the number of copies in DNA. Practical issues such as the contamination of tumor cells in tissue specimens and normalization errors necessitate the use of statistics for learning about the genomic alterations from array CGH data. As increasing amounts of array CGH data become available, there is a growing need for automated algorithms for characterizing genomic profiles. Specifically, there is a need for algorithms that can identify gains and losses in the number of copies based on statistical considerations, rather than merely detect trends in the data.We adopt a Bayesian approach, relying on the hidden Markov model to account for the inherent dependence in the intensity ratios. Posterior inferences are made about gains and losses in copy number. Localized amplifications (associated with oncogene mutations) and deletions (associated with mutations of tumor suppressors) are identified using posterior probabilities. Global trends such as extended regions of altered copy number are detected. Because the posterior distribution is analytically intractable, we implement a Metropolis-within-Gibbs algorithm for efficient simulation-based inference. Publicly available data on pancreatic adenocarcinoma, glioblastoma multiforme, and breast cancer are analyzed, and comparisons are made with some widely used algorithms to illustrate the reliability and success of the technique.  相似文献   

3.
The hidden Markov model (HMM) provides an attractive framework for modeling long-term persistence in a variety of applications including pattern recognition. Unlike typical mixture models, hidden Markov states can represent the heterogeneity in data and it can be extended to a multivariate case using a hierarchical Bayesian approach. This article provides a nonparametric Bayesian modeling approach to the multi-site HMM by considering stick-breaking priors for each row of an infinite state transition matrix. This extension has many advantages over a parametric HMM. For example, it can provide more flexible information for identifying the structure of the HMM than parametric HMM analysis, such as the number of states in HMM. We exploit a simulation example and a real dataset to evaluate the proposed approach.  相似文献   

4.
Finite memory sources and variable‐length Markov chains have recently gained popularity in data compression and mining, in particular, for applications in bioinformatics and language modelling. Here, we consider denser data compression and prediction with a family of sparse Bayesian predictive models for Markov chains in finite state spaces. Our approach lumps transition probabilities into classes composed of invariant probabilities, such that the resulting models need not have a hierarchical structure as in context tree‐based approaches. This can lead to a substantially higher rate of data compression, and such non‐hierarchical sparse models can be motivated for instance by data dependence structures existing in the bioinformatics context. We describe a Bayesian inference algorithm for learning sparse Markov models through clustering of transition probabilities. Experiments with DNA sequence and protein data show that our approach is competitive in both prediction and classification when compared with several alternative methods on the basis of variable memory length.  相似文献   

5.
In this paper we define a hierarchical Bayesian model for microarray expression data collected from several studies and use it to identify genes that show differential expression between two conditions. Key features include shrinkage across both genes and studies, and flexible modeling that allows for interactions between platforms and the estimated effect, as well as concordant and discordant differential expression across studies. We evaluated the performance of our model in a comprehensive fashion, using both artificial data, and a "split-study" validation approach that provides an agnostic assessment of the model's behavior not only under the null hypothesis, but also under a realistic alternative. The simulation results from the artificial data demonstrate the advantages of the Bayesian model. The 1 - AUC values for the Bayesian model are roughly half of the corresponding values for a direct combination of t- and SAM-statistics. Furthermore, the simulations provide guidelines for when the Bayesian model is most likely to be useful. Most noticeably, in small studies the Bayesian model generally outperforms other methods when evaluated by AUC, FDR, and MDR across a range of simulation parameters, and this difference diminishes for larger sample sizes in the individual studies. The split-study validation illustrates appropriate shrinkage of the Bayesian model in the absence of platform-, sample-, and annotation-differences that otherwise complicate experimental data analyses. Finally, we fit our model to four breast cancer studies employing different technologies (cDNA and Affymetrix) to estimate differential expression in estrogen receptor positive tumors versus negative ones. Software and data for reproducing our analysis are publicly available.  相似文献   

6.
We propose a general Bayesian joint modeling approach to model mixed longitudinal outcomes from the exponential family for taking into account any differential misclassification that may exist among categorical outcomes. Under this framework, outcomes observed without measurement error are related to latent trait variables through generalized linear mixed effect models. The misclassified outcomes are related to the latent class variables, which represent unobserved real states, using mixed hidden Markov models (MHMMs). In addition to enabling the estimation of parameters in prevalence, transition and misclassification probabilities, MHMMs capture cluster level heterogeneity. A transition modeling structure allows the latent trait and latent class variables to depend on observed predictors at the same time period and also on latent trait and latent class variables at previous time periods for each individual. Simulation studies are conducted to make comparisons with traditional models in order to illustrate the gains from the proposed approach. The new approach is applied to data from the Southern California Children Health Study to jointly model questionnaire-based asthma state and multiple lung function measurements in order to gain better insight about the underlying biological mechanism that governs the inter-relationship between asthma state and lung function development.  相似文献   

7.
Hidden Markov models (HMMs) have been shown to be a flexible tool for modelling complex biological processes. However, choosing the number of hidden states remains an open question and the inclusion of random effects also deserves more research, as it is a recent addition to the fixed-effect HMM in many application fields. We present a Bayesian mixed HMM with an unknown number of hidden states and fixed covariates. The model is fitted using reversible-jump Markov chain Monte Carlo, avoiding the need to select the number of hidden states. We show through simulations that the estimations produced are more precise than those from a fixed-effect HMM and illustrate its practical application to the analysis of DNA copy number data, a field where HMMs are widely used.  相似文献   

8.
Hierarchical models are rather common in uncertainty theory. They arise when there is a ‘correct’ or ‘ideal’ (the so-called first-order) uncertainty model about a phenomenon of interest, but the modeler is uncertain about what it is. The modeler's uncertainty is then called second-order uncertainty. For most of the hierarchical models in the literature, both the first- and the second-order models are precise, i.e., they are based on classical probabilities. In the present paper, I propose a specific hierarchical model that is imprecise at the second level, which means that at this level, lower probabilities are used. No restrictions are imposed on the underlying first-order model: that is allowed to be either precise or imprecise. I argue that this type of hierarchical model generalizes and includes a number of existing uncertainty models, such as imprecise probabilities, Bayesian models, and fuzzy probabilities. The main result of the paper is what I call precision–imprecision equivalence: the implications of the model for decision making and statistical reasoning are the same, whether the underlying first-order model is assumed to be precise or imprecise.  相似文献   

9.
Assessing the selective influence of amino acid properties is important in understanding evolution at the molecular level. A collection of methods and models has been developed in recent years to determine if amino acid sites in a given DNA sequence alignment display substitutions that are altering or conserving a prespecified set of amino acid properties. Residues showing an elevated number of substitutions that favorably alter a physicochemical property are considered targets of positive natural selection. Such approaches usually perform independent analyses for each amino acid property under consideration, without taking into account the fact that some of the properties may be highly correlated. We propose a Bayesian hierarchical regression model with latent factor structure that allows us to determine which sites display substitutions that conserve or radically change a set of amino acid properties, while accounting for the correlation structure that may be present across such properties. We illustrate our approach by analyzing simulated data sets and an alignment of lysin sperm DNA.  相似文献   

10.
Typical joint modeling of longitudinal measurements and time to event data assumes that two models share a common set of random effects with a normal distribution assumption. But, sometimes the underlying population that the sample is extracted from is a heterogeneous population and detecting homogeneous subsamples of it is an important scientific question. In this paper, a finite mixture of normal distributions for the shared random effects is proposed for considering the heterogeneity in the population. For detecting whether the unobserved heterogeneity exits or not, we use a simple graphical exploratory diagnostic tool proposed by Verbeke and Molenberghs [34] to assess whether the traditional normality assumption for the random effects in the mixed model is adequate. In the joint modeling setting, in the case of evidence against normality (homogeneity), a finite mixture of normals is used for the shared random-effects distribution. A Bayesian MCMC procedure is developed for parameter estimation and inference. The methodology is illustrated using some simulation studies. Also, the proposed approach is used for analyzing a real HIV data set, using the heterogeneous joint model for this data set, the individuals are classified into two groups: a group with high risk and a group with moderate risk.  相似文献   

11.
The development of new technologies to measure gene expression has been calling for statistical methods to integrate findings across multiple-platform studies. A common goal of microarray analysis is to identify genes with differential expression between two conditions, such as treatment versus control. Here, we introduce a hierarchical Bayesian meta-analysis model to pool gene expression studies from different microarray platforms: spotted DNA arrays and short oligonucleotide arrays. The studies have different array design layouts, each with multiple sources of data replication, including repeated experiments, slides and probes. Our model produces the gene-specific posterior probability of differential expression, which is the basis for inference. In simulations combining two and five independent studies, our meta-analysis model outperformed separate analyses for three commonly used comparison measures; it also showed improved receiver operating characteristic curves. When combining spotted DNA and CombiMatrix short oligonucleotide array studies of Geobacter sulfurreducens, our meta-analysis model discovered more genes for fixed thresholds of posterior probability of differential expression and Bayesian false discovery than individual study analyses. We also examine an alternative model and compare models using the deviance information criterion.  相似文献   

12.
Cluster analysis is one of the most widely used method in statistical analyses, in which homogeneous subgroups are identified in a heterogeneous population. Due to the existence of the continuous and discrete mixed data in many applications, so far, some ordinary clustering methods such as, hierarchical methods, k-means and model-based methods have been extended for analysis of mixed data. However, in the available model-based clustering methods, by increasing the number of continuous variables, the number of parameters increases and identifying as well as fitting an appropriate model may be difficult. In this paper, to reduce the number of the parameters, for the model-based clustering mixed data of continuous (normal) and nominal data, a set of parsimonious models is introduced. Models in this set are extended, using the general location model approach, for modeling distribution of mixed variables and applying factor analyzer structure for covariance matrices. The ECM algorithm is used for estimating the parameters of these models. In order to show the performance of the proposed models for clustering, results from some simulation studies and analyzing two real data sets are presented.  相似文献   

13.
Empirical Bayes estimates of the local false discovery rate can reflect uncertainty about the estimated prior by supplementing their Bayesian posterior probabilities with confidence levels as posterior probabilities. This use of coherent fiducial inference with hierarchical models generates set estimators that propagate uncertainty to varying degrees. Some of the set estimates approach estimates from plug-in empirical Bayes methods for high numbers of comparisons and can come close to the usual confidence sets given a sufficiently low number of comparisons.  相似文献   

14.
This work examines the problem of locating changes in the distribution of a Compound Poisson Process where the variables being summed are iid normal and the number of variable follows the Poisson distribution. A Bayesian approach is developed to identify the location of significant changes in any of the parameters of the distribution, and a sliding window algorithm is used to identify multiple change points. These results can be applied in any field of study where an interest in locating changes not only in the parameter of a normally distributed data set but also in the rate of their occurrence. It has direct application to the study of DNA copy number variations in cancer research, where it is known that the distances between the genes can affect their intensity level.  相似文献   

15.
We propose a Bayesian hierarchical model for multiple comparisons in mixed models where the repeated measures on subjects are described with the subject random effects. The model facilitates inferences in parameterizing the successive differences of the population means, and for them, we choose independent prior distributions that are mixtures of a normal distribution and a discrete distribution with its entire mass at zero. For the other parameters, we choose conjugate or vague priors. The performance of the proposed hierarchical model is investigated in the simulated and two real data sets, and the results illustrate that the proposed hierarchical model can effectively conduct a global test and pairwise comparisons using the posterior probability that any two means are equal. A simulation study is performed to analyze the type I error rate, the familywise error rate, and the test power. The Gibbs sampler procedure is used to estimate the parameters and to calculate the posterior probabilities.  相似文献   

16.
Individual-level models (ILMs) for infectious disease can be used to model disease spread between individuals while taking into account important covariates. One important covariate in determining the risk of infection transfer can be spatial location. At the same time, measurement error is a concern in many areas of statistical analysis, and infectious disease modelling is no exception. In this paper, we are concerned with the issue of measurement error in the recorded location of individuals when using a simple spatial ILM to model the spread of disease within a population. An ILM that incorporates spatial location random effects is introduced within a hierarchical Bayesian framework. This model is tested upon both simulated data and data from the UK 2001 foot-and-mouth disease epidemic. The ability of the model to successfully identify both the spatial infection kernel and the basic reproduction number (R 0) of the disease is tested.  相似文献   

17.
Imprisonment levels vary widely across the United States, with some state imprisonment rates six times higher than others. Imposition of prison sentences also varies between counties within states, with previous research suggesting that covariates such as crime rate, unemployment level, racial composition, political conservatism, geographic region, and sentencing policies account for some of this variation. Other studies, using court data on individual felons, demonstrate how type of offense, demographics, criminal history, and case characteristics affect sentence severity. This article considers the effects of both county-level and individual-level covariates on whether a convicted felon receives a prison sentence rather than a jail or non-custodial sentence. We analyze felony court case processing data from May 1998 for 39 of the nation's most populous urban counties using a Bayesian hierarchical logistic regression model. By adopting a Bayesian approach, we are able to overcome a number of challenges. The model allows individual-level effects to vary by county, but relates these effects across counties using county-level covariates. We account for missing data using imputation via additional Gibbs sampling steps when estimating the model. Finally, we use posterior samples to construct novel predictor effect plots to aid communication of results to criminal justice policy-makers.  相似文献   

18.
Integro-difference equations (IDEs) provide a flexible framework for dynamic modeling of spatio-temporal data. The choice of kernel in an IDE model relates directly to the underlying physical process modeled, and it can affect model fit and predictive accuracy. We introduce Bayesian non-parametric methods to the IDE literature as a means to allow flexibility in modeling the kernel. We propose a mixture of normal distributions for the IDE kernel, built from a spatial Dirichlet process for the mixing distribution, which can model kernels with shapes that change with location. This allows the IDE model to capture non-stationarity with respect to location and to reflect a changing physical process across the domain. We address computational concerns for inference that leverage the use of Hermite polynomials as a basis for the representation of the process and the IDE kernel, and incorporate Hamiltonian Markov chain Monte Carlo steps in the posterior simulation method. An example with synthetic data demonstrates that the model can successfully capture location-dependent dynamics. Moreover, using a data set of ozone pressure, we show that the spatial Dirichlet process mixture model outperforms several alternative models for the IDE kernel, including the state of the art in the IDE literature, that is, a Gaussian kernel with location-dependent parameters.  相似文献   

19.
This article considers multinomial data subject to misclassification in the presence of covariates which affect both the misclassification probabilities and the true classification probabilities. A subset of the data may be subject to a secondary measurement according to an infallible classifier. Computations are carried out in a Bayesian setting where it is seen that the prior has an important role in driving the inference. In addition, a new and less problematic definition of nonidentifiability is introduced and is referred to as hierarchical nonidentifiability.  相似文献   

20.
We present a Bayesian analysis framework for matrix-variate normal data with dependency structures induced by rows and columns. This framework of matrix normal models includes prior specifications, posterior computation using Markov chain Monte Carlo methods, evaluation of prediction uncertainty, model structure search, and extensions to multidimensional arrays. Compared with Bayesian probabilistic matrix factorization, which integrates a Gaussian prior for single row of the data matrix, our proposed model, namely Bayesian hierarchical kernelized probabilistic matrix factorization, imposes Gaussian Process priors over multiple rows of the matrix. Hence, the learned model explicitly captures the underlying correlation among the rows and the columns. In addition, our method requires no specific assumptions like independence of latent factors for rows and columns, which obtains more flexibility for modeling real data compared to existing works. Finally, the proposed framework can be adapted to a wide range of applications, including multivariate analysis, times series, and spatial modeling. Experiments highlight the superiority of the proposed model in handling model uncertainty and model optimization.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号