首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We revisit the problem of estimating the proportion π of true null hypotheses where a large scale of parallel hypothesis tests are performed independently. While the proportion is a quantity of interest in its own right in applications, the problem has arisen in assessing or controlling an overall false discovery rate. On the basis of a Bayes interpretation of the problem, the marginal distribution of the p-value is modeled in a mixture of the uniform distribution (null) and a non-uniform distribution (alternative), so that the parameter π of interest is characterized as the mixing proportion of the uniform component on the mixture. In this article, a nonparametric exponential mixture model is proposed to fit the p-values. As an alternative approach to the convex decreasing mixture model, the exponential mixture model has the advantages of identifiability, flexibility, and regularity. A computation algorithm is developed. The new approach is applied to a leukemia gene expression data set where multiple significance tests over 3,051 genes are performed. The new estimate for π with the leukemia gene expression data appears to be about 10% lower than the other three estimates that are known to be conservative. Simulation results also show that the new estimate is usually lower and has smaller bias than the other three estimates.  相似文献   

2.
A Bayesian mixture model for differential gene expression   总被引:3,自引:0,他引:3  
Summary.  We propose model-based inference for differential gene expression, using a nonparametric Bayesian probability model for the distribution of gene intensities under various conditions. The probability model is a mixture of normal distributions. The resulting inference is similar to a popular empirical Bayes approach that is used for the same inference problem. The use of fully model-based inference mitigates some of the necessary limitations of the empirical Bayes method. We argue that inference is no more difficult than posterior simulation in traditional nonparametric mixture-of-normal models. The approach proposed is motivated by a microarray experiment that was carried out to identify genes that are differentially expressed between normal tissue and colon cancer tissue samples. Additionally, we carried out a small simulation study to verify the methods proposed. In the motivating case-studies we show how the nonparametric Bayes approach facilitates the evaluation of posterior expected false discovery rates. We also show how inference can proceed even in the absence of a null sample of known non-differentially expressed scores. This highlights the difference from alternative empirical Bayes approaches that are based on plug-in estimates.  相似文献   

3.
The development of new technologies to measure gene expression has been calling for statistical methods to integrate findings across multiple-platform studies. A common goal of microarray analysis is to identify genes with differential expression between two conditions, such as treatment versus control. Here, we introduce a hierarchical Bayesian meta-analysis model to pool gene expression studies from different microarray platforms: spotted DNA arrays and short oligonucleotide arrays. The studies have different array design layouts, each with multiple sources of data replication, including repeated experiments, slides and probes. Our model produces the gene-specific posterior probability of differential expression, which is the basis for inference. In simulations combining two and five independent studies, our meta-analysis model outperformed separate analyses for three commonly used comparison measures; it also showed improved receiver operating characteristic curves. When combining spotted DNA and CombiMatrix short oligonucleotide array studies of Geobacter sulfurreducens, our meta-analysis model discovered more genes for fixed thresholds of posterior probability of differential expression and Bayesian false discovery than individual study analyses. We also examine an alternative model and compare models using the deviance information criterion.  相似文献   

4.
Identifying differentially expressed genes is a basic objective in microarray experiments. Many statistical methods for detecting differentially expressed genes in multiple-slide experiments have been proposed. However, sometimes with limited experimental resources, only a single cDNA array or two Oligonuleotide arrays could be made or only insufficient replicated arrays could be conducted. Many current statistical models cannot be used because of the non-availability of replicated data. Simply using fold changes is also unreliable and inefficient [Chen et al. 1997. Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Optics 2, 364–374; Newton et al. 2001. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J. Comput. Biol. 8, 37–52; Pan et al. 2002. How many replicates of arrays are required to detect gene expression changes in microarray experiments? a mixture model approach. Genome Biol. 3, research0022.1-0022.10]. We propose a new method. If the log-transformed ratios for the expressed genes as well as unexpressed genes have equal variance, we use a Hadamard matrix to construct a t-test from a single array data. Basically, we test whether each doubtful gene has significantly differential expression compared to the unexpressed genes. We form some new random variables corresponding to the rows of a Hadamard matrix using the algebraic sum of gene expressions. A one-sample t-test is constructed and the p-value is calculated for each doubtful gene based on these random variables. By using any method for multiple testing, adjusted p-values could be obtained from original p-values and significance of doubtful genes can be determined. When the variance of expressed genes differs from the variance of unexpressed genes, we construct a z-statistic based on the result from application of Hadamard matrix and find the confidence interval to retain the null hypothesis. Using the interval, we determine differentially expressed genes. This method is also useful for multiple microarrays, especially when sufficient replicated data are not available for a traditional t-test. We apply our methodology to ApoAI data. The results appear to be promising. They not only confirm the early known differentially expressed genes, but also indicate more genes to be differentially expressed.  相似文献   

5.
Summary.  In microarray experiments, accurate estimation of the gene variance is a key step in the identification of differentially expressed genes. Variance models go from the too stringent homoscedastic assumption to the overparameterized model assuming a specific variance for each gene. Between these two extremes there is some room for intermediate models. We propose a method that identifies clusters of genes with equal variance. We use a mixture model on the gene variance distribution. A test statistic for ranking and detecting differentially expressed genes is proposed. The method is illustrated with publicly available complementary deoxyribonucleic acid microarray experiments, an unpublished data set and further simulation studies.  相似文献   

6.
ABSTRACT

We introduce a new parsimonious bimodal distribution, referred to as the bimodal skew-symmetric Normal (BSSN) distribution, which is potentially effective in capturing bimodality, excess kurtosis, and skewness. Explicit expressions for the moment-generating function, mean, variance, skewness, and excess kurtosis were derived. The shape properties of the proposed distribution were investigated in regard to skewness, kurtosis, and bimodality. Maximum likelihood estimation was considered and an expression for the observed information matrix was provided. Illustrative examples using medical and financial data as well as simulated data from a mixture of normal distributions were worked.  相似文献   

7.
Summary.  The importance of incorporating existing biological knowledge, such as gene functional annotations in gene ontology, in analysing high throughput genomic and proteomic data is being increasingly recognized. In the context of detecting differential gene expression, however, the current practice of using gene annotations is limited primarily to validations. Here we take a direct approach to incorporating gene annotations into mixture models for analysis. First, in contrast with a standard mixture model assuming that each gene of the genome has the same distribution, we study stratified mixture models allowing genes with different annotations to have different distributions, such as prior probabilities. Second, rather than treating parameters in stratified mixture models independently, we propose a hierarchical model to take advantage of the hierarchical structure of most gene annotation systems, such as gene ontology. We consider a simplified implementation for the proof of concept. An application to a mouse microarray data set and a simulation study demonstrate the improvement of the two new approaches over the standard mixture model.  相似文献   

8.
Summary. An advantage of randomization tests for small samples is that an exact P -value can be computed under an additive model. A disadvantage with very small sample sizes is that the resulting discrete distribution for P -values can make it mathematically impossible for a P -value to attain a particular degree of significance. We investigate a distribution of P -values that arises when several thousand randomization tests are conducted simultaneously using small samples, a situation that arises with microarray gene expression data. We show that the distribution yields valuable information regarding groups of genes that are differentially expressed between two groups: a treatment group and a control group. This distribution helps to categorize genes with varying degrees of overlap of genetic expression values between the two groups, and it helps to quantify the degree of overlap by using the P -value from a randomization test. Moreover, a statistical test is available that compares the actual distribution of P -values with an expected distribution if there are no genes that are differentially expressed. We demonstrate the method and illustrate the results by using a microarray data set involving a cell line for rheumatoid arthritis. A small simulation study evaluates the effect that correlated gene expression levels could have on results from the analysis.  相似文献   

9.
ABSTRACT

This article discusses two asymmetrization methods, Azzalini's representation and beta generation, to generate asymmetric bimodal models including two novel beta-generated models. The practical utility of these models is assessed with nine data sets from different fields of applied sciences. Besides this tutorial assessment, some methodological contributions are made: a random number generator for the asymmetric Rathie–Swamee model is developed (generators for the other models are already known and briefly described) and a new likelihood ratio test of unimodality is compared via simulations with other available tests. Several tools have been used to quantify and test for bimodality and assess goodness of fit including Bayesian information criterion, measures of agreement with the empirical distribution and the Kolmogorov–Smirnoff test. In the nine case studies, the results favoured models derived from Azzalini's asymmetrization, but no single model provided a best fit across the applications considered. In only two cases the normal mixture was selected as best model. Parameter estimation has been done by likelihood maximization. Numerical optimization must be performed with care since local optima are often present. We concluded that the models considered are flexible enough to fit different bimodal shapes and that the tools studied should be used with care and attention to detail.  相似文献   

10.
Massively Parallel Signature Sequencing (MPSS) is a high-throughput counting-based technology available for gene expression profiling. It produces output that is similar to Serial Analysis of Gene Expression (SAGE) and is ideal for building complex relational databases for gene expression. Our goal is to compare the in vivo global gene expression profiles of tissues infected with different strains of Salmonella obtained using the MPSS technology. In this article, we develop an exact ANOVA type model for this count data using a zero-inflated Poisson (ZIP) distribution, different from existing methods that assume continuous densities. We adopt two Bayesian hierarchical models-one parametric and the other semiparametric with a Dirichlet process prior that has the ability to "borrow strength" across related signatures, where a signature is a specific arrangement of the nucleotides, usually 16-21 base-pairs long. We utilize the discreteness of Dirichlet process prior to cluster signatures that exhibit similar differential expression profiles. Tests for differential expression are carried out using non-parametric approaches, while controlling the false discovery rate. We identify several differentially expressed genes that have important biological significance and conclude with a summary of the biological discoveries.  相似文献   

11.
Approximating the distribution of mobile communications expenditures (MCE) is complicated by zero observations in the sample. To deal with the zero observations by allowing a point mass at zero, a mixture model of MCE distributions is proposed and applied. The MCE distribution is specified as a mixture of two distributions, one with a point mass at zero and the other with full support on the positive half of the real line. The model is empirically verified for individual MCE survey data collected in Seoul, Korea. The mixture model can easily capture the common bimodality feature of the MCE distribution. In addition, when covariates were added to the model, it was found that the probability that an individual has non-expenditure significantly varies with some variables. Finally, the goodness-of-fit test suggests that the data are well represented by the mixture model.  相似文献   

12.
13.
Bimodal truncated count distributions are frequently observed in aggregate survey data and in user ratings when respondents are mixed in their opinion. They also arise in censored count data, where the highest category might create an additional mode. Modeling bimodal behavior in discrete data is useful for various purposes, from comparing shapes of different samples (or survey questions) to predicting future ratings by new raters. The Poisson distribution is the most common distribution for fitting count data and can be modified to achieve mixtures of truncated Poisson distributions. However, it is suitable only for modeling equidispersed distributions and is limited in its ability to capture bimodality. The Conway–Maxwell–Poisson (CMP) distribution is a two-parameter generalization of the Poisson distribution that allows for over- and underdispersion. In this work, we propose a mixture of CMPs for capturing a wide range of truncated discrete data, which can exhibit unimodal and bimodal behavior. We present methods for estimating the parameters of a mixture of two CMP distributions using an EM approach. Our approach introduces a special two-step optimization within the M step to estimate multiple parameters. We examine computational and theoretical issues. The methods are illustrated for modeling ordered rating data as well as truncated count data, using simulated and real examples.  相似文献   

14.
Abstract. DNA array technology is an important tool for genomic research due to its capa‐city of measuring simultaneously the expression levels of a great number of genes or fragments of genes in different experimental conditions. An important point in gene expression data analysis is to identify clusters of genes which present similar expression levels. We propose a new procedure for estimating the mixture model for clustering of gene expression data. The proposed method is a posterior split‐merge‐birth MCMC procedure which does not require the specification of the number of components, since it is estimated jointly with component parameters. The strategy for splitting is based on data and on posterior distribution from the previously allocated observations. This procedure defines a quick split proposal in contrary to other split procedures, which require substantial computational effort. The performance of the method is verified using real and simulated datasets.  相似文献   

15.
Summary.  Advances in understanding the biological underpinnings of many cancers have led increasingly to the use of molecularly targeted anticancer therapies. Because the platelet-derived growth factor receptor (PDGFR) has been implicated in the progression of prostate cancer bone metastases, it is of great interest to examine possible relationships between PDGFR inhibition and therapeutic outcomes. We analyse the association between change in activated PDGFR (phosphorylated PDGFR) and progression-free survival time based on large within-patient samples of cell-specific phosphorylated PDGFR values taken before and after treatment from each of 88 prostate cancer patients. To utilize these paired samples as covariate data in a regression model for progression-free survival time, and be cause the phosphorylated PDGFR distributions are bimodal, we first employ a Bayesian hierarchical mixture model to obtain a deconvolution of the pretreatment and post-treatment within-patient phosphorylated PDGFR distributions. We evaluate fits of the mixture model and a non-mixture model that ignores the bimodality by using a supnorm metric to compare the empirical distribution of each phosphorylated PDGFR data set with the corresponding fitted distribution under each model. Our results show that first using the mixture model to account for the bimodality of the within-patient phosphorylated PDGFR distributions, and then using the posterior within-patient component mean changes in phosphorylated PDGFR so obtained as covariates in the regression model for progression-free survival time, provides an improved estimation.  相似文献   

16.
Summary.  As biological knowledge accumulates rapidly, gene networks encoding genomewide gene–gene interactions have been constructed. As an improvement over the standard mixture model that tests all the genes identically and independently distributed a priori , Wei and co-workers have proposed modelling a gene network as a discrete or Gaussian Markov random field (MRF) in a mixture model to analyse genomic data. However, how these methods compare in practical applications is not well understood and this is the aim here. We also propose two novel constraints in prior specifications for the Gaussian MRF model and a fully Bayesian approach to the discrete MRF model. We assess the accuracy of estimating the false discovery rate by posterior probabilities in the context of MRF models. Applications to a chromatin immuno-precipitation–chip data set and simulated data show that the modified Gaussian MRF models have superior performance compared with other models, and both MRF-based mixture models, with reasonable robustness to misspecified gene networks, outperform the standard mixture model.  相似文献   

17.
在联合广义线性模型中,散度参数与均值都被赋予了广义线性模型的结构,本文主要考虑在只有分布的一阶矩和二阶矩指定的条件下,联合广义线性模型中均值部分的变量选择问题。本文采用广义拟似然函数,提出了新的模型选择准则(EAIC);该准则是Akaike信息准则的推广。论文通过模拟研究验证了该准则的效果。  相似文献   

18.
This article deals with some important computational aspects of the generalized von Mises distribution in relation with parameter estimation, model selection and simulation. The generalized von Mises distribution provides a flexible model for circular data allowing for symmetry, asymmetry, unimodality and bimodality. For this model, we show the equivalence between the trigonometric method of moments and the maximum likelihood estimators, we give their asymptotic distribution, we provide bias-corrected estimators of the entropy, the Akaike information criterion and the measured entropy for model selection, and we implement the ratio-of-uniforms method of simulation.  相似文献   

19.
Microarray experiments are being widely used in medical and biological research. The main features of these studies are the large number of variables (genes) involved and the low number of replicates (arrays). It seems clear that the most appropriate models, when looking for detecting differences in gene expression are those that exploit the most useful information to compensate for the lack of replicates. On the other hand, the control of the error in the decision process plays an important role for the high number of simultaneous statistical tests (one for each gene), so that concepts such as the false discovery rate (FDR) take a special importance. One of the alternatives for the analysis of the data in these experiments is based on the calculation of statistics derived from modifications of the classical methods used in this type of problems (moderated-t, B-statistic). Nonparametric techniques have been also proposed [B. Efron, R. Tibshirani, J.D. Storey, and V. Tusher, Empirical Bayes analysis of a microarray experiment, J. Amer. Stat. Assoc. 96 (2001), pp. 1151–1160; W. Pan, J. Lin, and C.T. Le, A mixture model approach to detecting differentially expressed genes with microarray data, Funct. Integr. Genomics 3 (2003), pp. 117–124], allowing the analysis without assuming any prior condition about the distribution of the data, which make them especially suitable in such situations. This paper presents a new method to detect differentially expressed genes based on non-parametric density estimation by a class of functions that allow us to define a distance between individuals in the sample (characterized by the coordinates of the individual (gene) in the dual space tangent to the manifold of parameters) [A. Miñarro and J.M. Oller, Some remarks on the individuals-score distance and its applications to statistical inference, Qüestiió, 16 (1992), pp. 43–57]. From these distances, we designed the test to determine the rejection region based on the control of FDR.  相似文献   

20.
A mixture of the MANOVA and GMANOVA models is presented. The expected value of the response matrix in this model is the sum of two matrix components. The first component represents the GMANOVA portion and the second component represents the MANOVA portion. Maximum likelihood estimators are derived for the parameters in this model, and goodness-of-fit tests are constructed for fuller models via the likelihood ration criterion. Finally, likelihood ration tests for general liinear hypotheses are developed and a numerical example is presented.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号