首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
Summary.  In microarray experiments, accurate estimation of the gene variance is a key step in the identification of differentially expressed genes. Variance models go from the too stringent homoscedastic assumption to the overparameterized model assuming a specific variance for each gene. Between these two extremes there is some room for intermediate models. We propose a method that identifies clusters of genes with equal variance. We use a mixture model on the gene variance distribution. A test statistic for ranking and detecting differentially expressed genes is proposed. The method is illustrated with publicly available complementary deoxyribonucleic acid microarray experiments, an unpublished data set and further simulation studies.  相似文献   

2.
Summary. An advantage of randomization tests for small samples is that an exact P -value can be computed under an additive model. A disadvantage with very small sample sizes is that the resulting discrete distribution for P -values can make it mathematically impossible for a P -value to attain a particular degree of significance. We investigate a distribution of P -values that arises when several thousand randomization tests are conducted simultaneously using small samples, a situation that arises with microarray gene expression data. We show that the distribution yields valuable information regarding groups of genes that are differentially expressed between two groups: a treatment group and a control group. This distribution helps to categorize genes with varying degrees of overlap of genetic expression values between the two groups, and it helps to quantify the degree of overlap by using the P -value from a randomization test. Moreover, a statistical test is available that compares the actual distribution of P -values with an expected distribution if there are no genes that are differentially expressed. We demonstrate the method and illustrate the results by using a microarray data set involving a cell line for rheumatoid arthritis. A small simulation study evaluates the effect that correlated gene expression levels could have on results from the analysis.  相似文献   

3.
An account of the behavior of the independent-samples t-test when applied to homoschedastic bivariate normal data is presented, and a comparison is made with the paired-samples t-test. Since the significance level is not violated when applying the independent-samples t-test to data which consist of positively correlated pairs and since the estimate of the variance is based on a larger number of ‘degrees of freedom’, the results suggest that when the sample size is small, one should not worry much about the possible existence of weak positive correlation. One may do better, powerwise, to ignore such correlation and use the independent-samples t-test, as though the samples were independent.  相似文献   

4.
Early investigations of the effects of non-normality indicated that skewness has a greater effect on the distribution of t-statistic than does kurtosis. When the distribution is skewed, the actual p-values can be larger than the values calculated from the t-tables. Transformation of data to normality has shown good results in the case of univariate t-test. In order to reduce the effect of skewness of the distribution on normal-based t-test, one can transform the data and perform the t-test on the transformed scale. This method is not only a remedy for satisfying the distributional assumption, but it also turns out that one can achieve greater efficiency of the test. We investigate the efficiency of tests after a Box-Cox transformation. In particular, we consider the one sample test of location and study the gains in efficiency for one-sample t-test following a Box-Cox transformation. Under some conditions, we prove that the asymptotic relative efficiency of transformed t-test and Hotelling's T 2-test of multivariate location with respect to the same statistic based on untransformed data is at least one.  相似文献   

5.
Preliminary testing procedures for the two means problem traditionally employ the pooled variance t-statistic. In this paper we show that bias of the t-statistic under conditions of heterogeneity of variance may be increased if use of the t-statistic is conditional on an affirmative F-test. For this reason we conclude that use of the t-statistic in preliminary testing procedures is inappropriate.  相似文献   

6.
Fan J  Feng Y  Niu YS 《Annals of statistics》2010,38(5):2723-2750
Estimation of genewise variance arises from two important applications in microarray data analysis: selecting significantly differentially expressed genes and validation tests for normalization of microarray data. We approach the problem by introducing a two-way nonparametric model, which is an extension of the famous Neyman-Scott model and is applicable beyond microarray data. The problem itself poses interesting challenges because the number of nuisance parameters is proportional to the sample size and it is not obvious how the variance function can be estimated when measurements are correlated. In such a high-dimensional nonparametric problem, we proposed two novel nonparametric estimators for genewise variance function and semiparametric estimators for measurement correlation, via solving a system of nonlinear equations. Their asymptotic normality is established. The finite sample property is demonstrated by simulation studies. The estimators also improve the power of the tests for detecting statistically differentially expressed genes. The methodology is illustrated by the data from MicroArray Quality Control (MAQC) project.  相似文献   

7.
Microarray studies are now common for human, agricultural plant and animal studies. False discovery rate (FDR) is widely used in the analysis of large-scale microarray data to account for problems associated with multiple testing. A well-designed microarray study should have adequate statistical power to detect the differentially expressed (DE) genes, while keeping the FDR acceptably low. In this paper, we used a mixture model of expression responses involving DE genes and non-DE genes to analyse theoretical FDR and power for simple scenarios where it is assumed that each gene has equal error variance and the gene effects are independent. A simulation study was used to evaluate the empirical FDR and power for more complex scenarios with unequal error variance and gene dependence. Based on this approach, we present a general guide for sample size requirement at the experimental design stage for prospective microarray studies. This paper presented an approach to explicitly connect the sample size with FDR and power. While the methods have been developed in the context of one-sample microarray studies, they are readily applicable to two-sample, and could be adapted to multiple-sample studies.  相似文献   

8.
Selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes. In this paper, we propose a flexible rank-based nonparametric procedure for gene selection from microarray data. In the method we propose a statistic for testing whether area under receiver operating characteristic curve (AUC) for each gene is equal to 0.5 allowing different variance for each gene. The contribution to this “single gene” statistic is the studentization of the empirical AUC, which takes into account the variances associated with each gene in the experiment. Delong et al. proposed a nonparametric procedure for calculating a consistent variance estimator of the AUC. We use their variance estimation technique to get a test statistic, and we focus on the primary step in the gene selection process, namely, the ranking of genes with respect to a statistical measure of differential expression. Two real datasets are analyzed to illustrate the methods and a simulation study is carried out to assess the relative performance of different statistical gene ranking measures. The work includes how to use the variance information to produce a list of significant targets and assess differential gene expressions under two conditions. The proposed method does not involve complicated formulas and does not require advanced programming skills. We conclude that the proposed methods offer useful analytical tools for identifying differentially expressed genes for further biological and clinical analysis.  相似文献   

9.
In this paper, we study the multi-class differential gene expression detection for microarray data. We propose a likelihood-based approach to estimating an empirical null distribution to incorporate gene interactions and provide a more accurate false-positive control than the commonly used permutation or theoretical null distribution-based approach. We propose to rank important genes by p-values or local false discovery rate based on the estimated empirical null distribution. Through simulations and application to lung transplant microarray data, we illustrate the competitive performance of the proposed method.  相似文献   

10.
Without the exchangeability assumption, permutation tests for comparing two population means do not provide exact control of the probability of making a Type I error. Another drawback of permutation tests is that it cannot be used to test hypothesis about one population. In this paper, we propose a new type of permutation tests for testing the difference between two population means: the split sample permutation t-tests. We show that the split sample permutation t-tests do not require the exchangeability assumption, are asymptotically exact and can be easily extended to testing hypothesis about one population. Extensive simulations were carried out to evaluate the performance of two specific split sample permutation t-tests: the split in the middle permutation t-test and the split in the end permutation t-test. The simulation results show that the split in the middle permutation t-test has comparable performance to the permutation test if the population distributions are symmetric and satisfy the exchangeability assumption. Otherwise, the split in the end permutation t-test has significantly more accurate control of level of significance than the split in the middle permutation t-test and other existing permutation tests.  相似文献   

11.
Microarray experiments are being widely used in medical and biological research. The main features of these studies are the large number of variables (genes) involved and the low number of replicates (arrays). It seems clear that the most appropriate models, when looking for detecting differences in gene expression are those that exploit the most useful information to compensate for the lack of replicates. On the other hand, the control of the error in the decision process plays an important role for the high number of simultaneous statistical tests (one for each gene), so that concepts such as the false discovery rate (FDR) take a special importance. One of the alternatives for the analysis of the data in these experiments is based on the calculation of statistics derived from modifications of the classical methods used in this type of problems (moderated-t, B-statistic). Nonparametric techniques have been also proposed [B. Efron, R. Tibshirani, J.D. Storey, and V. Tusher, Empirical Bayes analysis of a microarray experiment, J. Amer. Stat. Assoc. 96 (2001), pp. 1151–1160; W. Pan, J. Lin, and C.T. Le, A mixture model approach to detecting differentially expressed genes with microarray data, Funct. Integr. Genomics 3 (2003), pp. 117–124], allowing the analysis without assuming any prior condition about the distribution of the data, which make them especially suitable in such situations. This paper presents a new method to detect differentially expressed genes based on non-parametric density estimation by a class of functions that allow us to define a distance between individuals in the sample (characterized by the coordinates of the individual (gene) in the dual space tangent to the manifold of parameters) [A. Miñarro and J.M. Oller, Some remarks on the individuals-score distance and its applications to statistical inference, Qüestiió, 16 (1992), pp. 43–57]. From these distances, we designed the test to determine the rejection region based on the control of FDR.  相似文献   

12.
By a family of designs we mean a set of designs whose parameters can be represented as functions of an auxiliary variable t where the design will exist for infinitely many values of t. The best known family is probably the family of finite projective planes with υ = b = t2 + t + 1, r = k = t + 1, and λ = 1. In some instances, notably coding theory, the existence of families is essential to provide the degree of precision required which can well vary from one coding problem to another. A natural vehicle for developing binary codes is the class of Hadamard matrices. Bush (1977) introduced the idea of constructing semi-regular designs using Hadamard matrices whereas the present study is concerned mostly with construction of regular designs using Hadamard matrices. While codes constructed from these designs are not optimal in the usual sense, it is possible that they may still have substantial value since, with different values of λ1 and λ2, there are different error correcting capabilities.  相似文献   

13.
In large-scale data, for example, analyzing microarray data, which includes hypothesis testing for equality of means in order to discover differentially expressed genes, often deals with a large number of features versus a few number of replicates. Furthermore, some genes are differentially expressed and some others not. Thus, a usual permutation method, which is applied facing these situations, estimates the p-value poorly. This is because two types of genes are mixed. To overcome this obstacle, the null permutation samples are suggested in the literatures. We propose a modified uniformly most powerful unbiased test for testing the null hypothesis.  相似文献   

14.
The family of t-designs is, without any doubt, the most important family of statistical designs. Their importance is due to their statistical optimalities, desirable symmetries for analyses and interpretations, and uses for constructing other important designs and structures such as Youden designs, generalized Youden designs, optimal fractional factorial designs, error defecting and correcting binary codes, balanced arrays, combinatorial filing systems, Hadamard matrices, finite projective and affine planes, strongly regular graphs, and so on. Research in the area of t-designs has been steadily and rapidly growing, especially during the last three decades. The number of publications in this area is in the several hundreds. Since papers on t-designs are published in a variety of journals, and because of the extensive role of these designs in design of experiments and other areas we believe it is imperative to gather these results and present them in varied form to suit diverse interests. This paper is an instance of such an attempt.  相似文献   

15.
Traditionally, an assessment for grain yield of rice is to split it into the yield components, including the number of panicles per plant, the number of spikelets per panicle, the 1000-grain weight and the filled-spikelet percentage, such that the yield performance can be individually evaluated through each component, and the products of yield components are employed for grain yield comparisons. However, when using the standard statistical methods, such as the two-sample t-test and analysis of variance, the assumptions of normality and variance homogeneity cannot be fully justified for comparing the grain yields, leading to that the empirical sizes cannot be adequately controlled. In this study, based on the concepts of generalized test variables and generalized p-values, a novel statistical testing procedure is developed for grain yield comparisons of rice. The proposed method is assessed by a series of numerical simulations. According to the simulation results, the proposed method performs reasonably well in Type I error control and empirical power. In addition, a real-life field experiment is analyzed by the proposed method, some productive rice varieties are screened out and suggested for a follow-up investigation.  相似文献   

16.
In this paper, we provide a unified framework for two-sample t-test with partially paired data. We show that many existing two-sample t-tests with partially paired data can be viewed as special members in our unified framework. Some shortcomings of these t-tests are discussed. We also propose the asymptotically optimal weighted linear combination of the test statistics comparing all four paired and unpaired data sets. Simulation studies are used to illustrate the performance of our proposed asymptotically optimal weighted combinations of test statistics and compare with some existing methods. It is found that our proposed test statistic is generally more powerful. Three real data sets about CD4 count, DNA extraction concentrations, and the quality of sleep are also analyzed by using our newly introduced test statistic.  相似文献   

17.
A number of procedures for testing adequacy of polynomial approximations to growth curves based on Rao’s test for additional information, Grizzle and Allen’s test or univariate t-tests were compared using data simulated from quadratic models. Quadratic models were indicated as adequately fitting the data in 95.10± 0.10 percent of analyses when the degree of the approximating polynomial was determined by the lowest-order significant coefficient (P = 0.05) that was followed by two successive nonsignificant ones according to separate t-tests. Procedures based on the Grizzle and Allen test and modifications of it indicated quadratic models in 86.70 ± 0.41 to 95.34 ±0.21 percent of analyses depending on error structure, variance and the number of coefficients analysed together. The t-test would be preferred in practice as its performance did not depend on the error structure or variance.  相似文献   

18.
Distributions of a response y (height, for example) differ with values of a factor t (such as age). Given a response y* for a subject of unknown t*, the objective of inverse prediction is to infer the value of t* and to provide a defensible confidence set for it. Training data provide values of y observed on subjects at known values of t. Models relating the mean and variance of y to t can be formulated as mixed (fixed and random) models in terms of sets of functions of t, such as polynomial spline functions. A confidence set on t* can then be had as those hypothetical values of t for which y* is not detected as an outlier when compared to the model fit to the training data. With nonconstant variance, the p-values for these tests are approximate. This article describes how versatile models for this problem can be formulated in such a way that the computations can be accomplished with widely available software for mixed models, such as SAS PROC MIXED. Coverage probabilities of confidence sets on t* are illustrated in an example.  相似文献   

19.
Rao (1947) provided two inequalities on parameters of an orthogonal array OA(N,m,s,t). An orthogonal array attaining these Rao bounds is said to be complete. Noda (1979) characterized complete orthogonal arrays of t=4 (strength). We here investigate complete orthogonal arrays with s=2 (levels) and general t; and with t=2, 3 and general s.  相似文献   

20.
Many exploratory studies such as microarray experiments require the simultaneous comparison of hundreds or thousands of genes. It is common to see that most genes in many microarray experiments are not expected to be differentially expressed. Under such a setting, a procedure that is designed to control the false discovery rate (FDR) is aimed at identifying as many potential differentially expressed genes as possible. The usual FDR controlling procedure is constructed based on the number of hypotheses. However, it can become very conservative when some of the alternative hypotheses are expected to be true. The power of a controlling procedure can be improved if the number of true null hypotheses (m 0) instead of the number of hypotheses is incorporated in the procedure [Y. Benjamini and Y. Hochberg, On the adaptive control of the false discovery rate in multiple testing with independent statistics, J. Edu. Behav. Statist. 25(2000), pp. 60–83]. Nevertheless, m 0 is unknown, and has to be estimated. The objective of this article is to evaluate some existing estimators of m 0 and discuss the feasibility of these estimators in incorporating into FDR controlling procedures under various experimental settings. The results of simulations can help the investigator to choose an appropriate procedure to meet the requirement of the study.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号