首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Genome-wide association studies commonly involve simultaneous tests of millions of single nucleotide polymorphisms (SNP) for disease association. The SNPs in nearby genomic regions, however, are often highly correlated due to linkage disequilibrium (LD, a genetic term for correlation). Simple Bonferonni correction for multiple comparisons is therefore too conservative. Permutation tests, which are often employed in practice, are both computationally expensive for genome-wide studies and limited in their scopes. We present an accurate and computationally efficient method, based on Poisson de-clumping heuristics, for approximating genome-wide significance of SNP associations. Compared with permutation tests and other multiple comparison adjustment approaches, our method computes the most accurate and robust p-value adjustments for millions of correlated comparisons within seconds. We demonstrate analytically that the accuracy and the efficiency of our method are nearly independent of the sample size, the number of SNPs, and the scale of p-values to be adjusted. In addition, our method can be easily adopted to estimate false discovery rate. When applied to genome-wide SNP datasets, we observed highly variable p-value adjustment results evaluated from different genomic regions. The variation in adjustments along the genome, however, are well conserved between the European and the African populations. The p-value adjustments are significantly correlated with LD among SNPs, recombination rates, and SNP densities. Given the large variability of sequence features in the genome, we further discuss a novel approach of using SNP-specific (local) thresholds to detect genome-wide significant associations. This article has supplementary material online.  相似文献   

2.
Affymetrix's SNP (single-nucleotide polymorphism) genotyping chips have increased the scope and decreased the cost of gene-mapping studies. Because each SNP is queried by multiple DNA probes, the chips present interesting challenges in genotype calling. Traditional clustering methods distinguish the three genotypes of an SNP fairly well given a large enough sample of unrelated individuals or a training sample of known genotypes. This article describes our attempt to improve genotype calling by constructing Gaussian mixture models with empirically derived priors. The priors stabilize parameter estimation and borrow information collectively gathered on tens of thousands of SNPs. When data from related family members are available, our models capture the correlations in signals between relatives. With these advantages in mind, we apply the models to Affymetrix probe intensity data on 10,000 SNPs gathered on 63 genotyped individuals spread over eight pedigrees. We integrate the genotype-calling model with pedigree analysis and examine a sequence of symmetry hypotheses involving the correlated probe signals. The symmetry hypotheses raise novel mathematical issues of parameterization. Using the Bayesian information criterion, we select the best combination of symmetry assumptions. Compared to Affymetrix's software, our model leads to a reduction in no-calls with little sacrifice in overall calling accuracy.  相似文献   

3.
Case-control studies of genetic polymorphisms and gene-environment interactions are reporting large numbers of statistically significant associations, many of which are likely to be spurious. This problem reflects the low prior probability that any one null hypothesis is false, and the large number of test results reported for a given study. In a Bayesian approach to the low prior probabilities, Wacholder et al. (2004) suggest supplementing the p-value for a hypothesis with its posterior probability given the study data. In a frequentist approach to the test multiplicity problem, Benjamini & Hochberg (1995) propose a hypothesis-rejection rule that provides greater statistical power by controlling the false discovery rate rather than the family-wise error rate controlled by the Bonferroni correction. This paper defines a Bayes false discovery rate and proposes a Bayes-based rejection rule for controlling it. The method, which combines the Bayesian approach of Wacholder et al. with the frequentist approach of Benjamini & Hochberg, is used to evaluate the associations reported in a case-control study of breast cancer risk and genetic polymorphisms of genes involved in the repair of double-strand DNA breaks.  相似文献   

4.
5.
We study the genotype calling algorithms for the high-throughput single-nucleotide polymorphism (SNP) arrays. Building upon the novel SNP-robust multi-chip average preprocessing approach and the state-of-the-art corrected robust linear model with Mahalanobis distance (CRLMM) approach for genotype calling, we propose a simple modification to better model and combine the information across multiple SNPs with empirical Bayes modeling, which could often significantly improve the genotype calling of CRLMM. Through applications to the HapMap Trio data set and a non-HapMap test set of high quality SNP chips, we illustrate the competitive performance of the proposed method.  相似文献   

6.
In biomedical research, profiling is now commonly conducted, generating high-dimensional genomic measurements (without loss of generality, say genes). An important analysis objective is to rank genes according to their marginal associations with a disease outcome/phenotype. Clinical-covariates, including for example clinical risk factors and environmental exposures, usually exist and need to be properly accounted for. In this study, we propose conducting marginal ranking of genes using a receiver operating characteristic (ROC) based method. This method can accommodate categorical, censored survival, and continuous outcome variables in a very similar manner. Unlike logistic-model-based methods, it does not make very specific assumptions on model, making it robust. In ranking genes, we account for both the main effects of clinical-covariates and their interactions with genes, and develop multiple diagnostic accuracy improvement measurements. Using simulation studies, we show that the proposed method is effective in that genes associated with or gene–covariate interactions associated with the outcome receive high rankings. In data analysis, we observe some differences between the rankings using the proposed method and the logistic-model-based method.  相似文献   

7.
A multistage variable selection method is introduced for detecting association signals in structured brain‐wide and genome‐wide association studies (brain‐GWAS). Compared to conventional methods that link one voxel to one single nucleotide polymorphism (SNP), our approach is more efficient and powerful in selecting the important signals by integrating anatomic and gene grouping structures in the brain and the genome, respectively. It avoids resorting to a large number of multiple comparisons while effectively controlling the false discoveries. Validity of the proposed approach is demonstrated by both theoretical investigation and numerical simulations. We apply our proposed method to a brain‐GWAS using Alzheimer's Disease Neuroimaging Initiative positron emission tomography (ADNI PET) imaging and genomic data. We confirm previously reported association signals and also uncover several novel SNPs and genes that are either associated with brain glucose metabolism or have their association significantly modified by Alzheimer's disease status.  相似文献   

8.
Technological advances in genotyping have given rise to hypothesis-based association studies of increasing scope. As a result, the scientific hypotheses addressed by these studies have become more complex and more difficult to address using existing analytic methodologies. Obstacles to analysis include inference in the face of multiple comparisons, complications arising from correlations among the SNPs (single nucleotide polymorphisms), choice of their genetic parametrization and missing data. In this paper we present an efficient Bayesian model search strategy that searches over the space of genetic markers and their genetic parametrization. The resulting method for Multilevel Inference of SNP Associations, MISA, allows computation of multilevel posterior probabilities and Bayes factors at the global, gene and SNP level, with the prior distribution on SNP inclusion in the model providing an intrinsic multiplicity correction. We use simulated data sets to characterize MISA's statistical power, and show that MISA has higher power to detect association than standard procedures. Using data from the North Carolina Ovarian Cancer Study (NCOCS), MISA identifies variants that were not identified by standard methods and have been externally "validated" in independent studies. We examine sensitivity of the NCOCS results to prior choice and method for imputing missing data. MISA is available in an R package on CRAN.  相似文献   

9.
Gene–gene interactions are often regarded as playing significant roles in influencing variabilities of complex traits. Although much research has been devoted to this area, to date a comprehensive statistical model that addresses the various sources of uncertainties, seem to be lacking. In this paper, we propose and develop a novel Bayesian semiparametric approach composed of finite mixtures based on Dirichlet processes and a hierarchical matrix-normal distribution that can comprehensively account for the unknown number of sub-populations and gene–gene interactions. Then, by formulating novel and suitable Bayesian tests of hypotheses we attempt to single out the roles of the genes, individually, and in interaction with other genes, in case-control studies. We also attempt to identify the significant loci associated with the disease. Our model facilitates a highly efficient parallel computing methodology, combining Gibbs sampling and Transformation-based MCMC (TMCMC). Application of our ideas to biologically realistic data sets revealed quite encouraging performance. We also applied our ideas to a real, myocardial infarction dataset, and obtained interesting results that partly agree with, and also complement, the existing works in this area, to reveal the importance of sophisticated and realistic modeling of gene–gene interactions.  相似文献   

10.
Molecular markers combined with powerful statistical tools have made it possible to detect and analyze multiple loci on the genome that are responsible for the phenotypic variation in quantitative traits. The objectives of the study presented in this paper are to identify a subset of single nucleotide polymorphism (SNP) markers that are associated with a particular trait and to construct a model that can best predict the value of the trait given the genotypic information of the SNPs using a three-step strategy. In the first step, a genome-wide association test is performed to screen SNPs that are associated with the quantitative trait of interest. SNPs with p-values of less than 5% are then analyzed in the second step. In the second step, a large number of randomly selected models, each consisting of a fixed number of randomly selected SNPs, are analyzed using the least angle regression method. This step will further remove redundant SNPs due to the complicated association among SNPs. A subset of SNPs that are shown to have a significant effect on the response trait more often than by chance are considered for the third step. In the third step, two alternative methods are considered: the least angle shrinkage and selection operation and sparse partial least squares regression. For both methods, the predictive ability of the fitted model is evaluated by an independent test set. The performance of the proposed method is illustrated by the analysis of a real data set on Canadian Holstein cattle.  相似文献   

11.
野生动物资源最优管理的动态经济模型及实证研究   总被引:1,自引:0,他引:1  
研究自然条件、社会发展和经济政策等多方面对野生动物资源管理的约束与系统分析野生动物资源管理的动态经济均衡,对于野生动物保护管理具有十分重要的理论支持和实践指导。在自然条件方面,以开放生物种群资源生长的Logistic增长模型作为约束条件;在社会发展方面,考虑产业资本投入的影响;在经济政策方面,以税收或补贴为约束条件,运用成本收益分析建立野生动物资源持续利用的动态经济均衡模型,并利用常微分方程、最大值原理等数学方法求解得到最优资源存量水平和最优收获量。同时以麝为例进行应用性实证分析,得到麝类资源的最优种群水平为143.01万只,最优持续收获量为66.24万只。当贴现率从0.01到0.1之间变化时,麝类资源最优种群水平将在149.93到134.36万只之间,而最优持续产量将维持在66.44到65.59万只之间。  相似文献   

12.
Case-control data are often used in medical-related applications, and most studies have applied parametric logistic regression to analyze such data. In this study, we investigated a semiparametric model for the analysis of case-control data by relaxing the linearity assumption of risk factors by using a partial smoothing spline model. A faster computation method for the model by extending the lower-dimensional approximation approach of Gu and Kim 4 Gu, C. and Kim, Y.-J. 2002. Penalized likelihood regression: General formulation and efficient approximation. Canad. J. Statist., 30: 619628. [Crossref], [Web of Science ®] [Google Scholar] developed in penalized likelihood regression is considered to apply to case-control studies. Simulations were conducted to evaluate the performance of the method with selected smoothing parameters and to compare the method with existing methods. The method was applied to Korean gastric cancer case-control data to estimate the nonparametric probability function of age and regression parameters for other categorical risk factors simultaneously. The method could be used in preliminary studies to identify whether there is a flexible function form of risk factors in the semiparametric logistic regression analysis involving a large data set.  相似文献   

13.
Nested case-control and case-cohort studies are useful for studying associations between covariates and time-to-event when some covariates are expensive to measure. Full covariate information is collected in the nested case-control or case-cohort sample only, while cheaply measured covariates are often observed for the full cohort. Standard analysis of such case-control samples ignores any full cohort data. Previous work has shown how data for the full cohort can be used efficiently by multiple imputation of the expensive covariate(s), followed by a full-cohort analysis. For large cohorts this is computationally expensive or even infeasible. An alternative is to supplement the case-control samples with additional controls on which cheaply measured covariates are observed. We show how multiple imputation can be used for analysis of such supersampled data. Simulations show that this brings efficiency gains relative to a traditional analysis and that the efficiency loss relative to using the full cohort data is not substantial.  相似文献   

14.
Recently, many researchers have devoted themselves to the investigation on the number of replicates needed for experiments in blocks of size two. In practice, experiments in blocks of size four might be more useful than those in blocks of size two. To estimate the main effects and two-factor interactions from a two-level factorial experiment in blocks, we might need many replicates. This article investigates designs with the least number of replicates for factorial experiments in blocks of size four. The methods to obtain such designs are presented.  相似文献   

15.
Clustered longitudinal data feature cross‐sectional associations within clusters, serial dependence within subjects, and associations between responses at different time points from different subjects within the same cluster. Generalized estimating equations are often used for inference with data of this sort since they do not require full specification of the response model. When data are incomplete, however, they require data to be missing completely at random unless inverse probability weights are introduced based on a model for the missing data process. The authors propose a robust approach for incomplete clustered longitudinal data using composite likelihood. Specifically, pairwise likelihood methods are described for conducting robust estimation with minimal model assumptions made. The authors also show that the resulting estimates remain valid for a wide variety of missing data problems including missing at random mechanisms and so in such cases there is no need to model the missing data process. In addition to describing the asymptotic properties of the resulting estimators, it is shown that the method performs well empirically through simulation studies for complete and incomplete data. Pairwise likelihood estimators are also compared with estimators obtained from inverse probability weighted alternating logistic regression. An application to data from the Waterloo Smoking Prevention Project is provided for illustration. The Canadian Journal of Statistics 39: 34–51; 2011 © 2010 Statistical Society of Canada  相似文献   

16.
This paper develops a method for handling two-class classification problems with highly unbalanced class sizes and misclassification costs. When the class sizes are highly unbalanced and the minority class represents a rare event, conventional classification methods tend to strongly favour the majority class, resulting in very low detection of the minority class. A method is proposed to determine the optimal cut-off for asymmetric misclassification costs and for unbalanced class sizes. Monte Carlo simulations show that this proposal performs better than the method based on the notion of classification accuracy. Finally, the proposed method is applied to empirical data on Italian small and medium enterprises to classify them into default and non-default groups.  相似文献   

17.
In electrical tomography, multiple measurements of voltage are taken between electrodes on the boundary of a region with the aim of investigating the electrical conductivity distribution within the region. The relationship between conductivity and voltage is governed by an elliptic partial differential equation derived from Maxwell’s equations. Recent statistical approaches, combining Bayesian methods with Markov chain Monte Carlo (MCMC) algorithms, allow to greater flexibility than classical inverse solution approaches and require only the calculation of voltages from a conductivity distribution. However, solution of this forward problem still requires the use of the Finite Difference Method (FDM) or the Finite Element Method (FEM) and many thousands of forward solutions are needed which strains practical feasibility. Many tomographic applications involve locating the perimeter of some homogeneous conductivity objects embedded in a homogeneous background. It is possible to exploit this type of structure using the Boundary Element Method (BEM) to provide a computationally efficient alternative forward solution technique. A geometric model is then used to define the region boundary, with priors on boundary smoothness and on the range of feasible conductivity values. This paper investigates the use of a BEM/MCMC approach for electrical resistance tomography (ERT) data. The efficiency of the boundary-element method coupled with the flexibility of the MCMC technique gives a promising new approach to object identification in electrical tomography. Simulated ERT data are used to illustrate the procedures.  相似文献   

18.
Parameter Estimation in Large Dynamic Paired Comparison Experiments   总被引:1,自引:0,他引:1  
Paired comparison data in which the abilities or merits of the objects being compared may be changing over time can be modelled as a non-linear state space model. When the population of objects being compared is large, likelihood-based analyses can be too computationally cumbersome to carry out regularly. This presents a problem for rating populations of chess players and other large groups which often consist of tens of thousands of competitors. This problem is overcome through a computationally simple non-iterative algorithm for fitting a particular dynamic paired comparison model. The algorithm, which improves over the commonly used algorithm of Elo by incorporating the variability in parameter estimates, can be performed regularly even for large populations of competitors. The method is evaluated on simulated data and is applied to ranking the best chess players of all time, and to ranking the top current tennis-players.  相似文献   

19.
The phase II clinical trials often use the binary outcome. Thus, accessing the success rate of the treatment is a primary objective for the phase II clinical trials. Reporting confidence intervals is a common practice for clinical trials. Due to the group sequence design and relatively small sample size, many existing confidence intervals for phase II trials are much conservative. In this paper, we propose a class of confidence intervals for binary outcomes. We also provide a general theory to assess the coverage of confidence intervals for discrete distributions, and hence make recommendations for choosing the parameter in calculating the confidence interval. The proposed method is applied to Simon's [14] optimal two-stage design with numerical studies. The proposed method can be viewed as an alternative approach for the confidence interval for discrete distributions in general.  相似文献   

20.
Mixture cure models are widely used when a proportion of patients are cured. The proportional hazards mixture cure model and the accelerated failure time mixture cure model are the most popular models in practice. Usually the expectation–maximisation (EM) algorithm is applied to both models for parameter estimation. Bootstrap methods are used for variance estimation. In this paper we propose a smooth semi‐nonparametric (SNP) approach in which maximum likelihood is applied directly to mixture cure models for parameter estimation. The variance can be estimated by the inverse of the second derivative of the SNP likelihood. A comprehensive simulation study indicates good performance of the proposed method. We investigate stage effects in breast cancer by applying the proposed method to breast cancer data from the South Carolina Cancer Registry.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号