首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
The transmission/disequilibrium test (TDT) is widely used to detect the linkage disequilibrium between a candidate locus (a marker) and a disease locus. The TDT is a family-based design and has the advantage that it is a valid test when population stratification exist. The TDT requires the marker genotypes of affected individuals and their parents. For diseases with late age of onset, it is difficult or impossible to obtain the marker genotype of the parents. Therefore, when both parents marker genotypes are unavailable, Ewex and Spielman extended the TDT to the S-TDT for use in sibships with at least one affected individual and one unaffected individual. When only one of the parents' genotype is available. Sun et al. proposed a test the 1-TDT, for use with niarker genotypes of affected individuals and only one available parent. Here, we study the saniple sizes of TDT, S-TDT, and 1-TDT. We show that the sample size needed for the 1-TDT is rogghly the same as the sample size needed for the S-TDT with two sibs and is about twice the sample size for the TDT.  相似文献   

2.
Many late-onset complex diseases exhibit variable age of onset. Efficiently incorporating age of onset information into linkage analysis can potentially increase the power of dissecting complex diseases. In this paper, we treat age of onset as a genetic trait with censored observations. We use multiple markers to infer the inheritance vector at the disease susceptibility (DS) locus in order to extract information about the inheritance pattern of the disease allele in a pedigree. Given the inheritance distribution at the DS locus, we define the genetic frailty for each individual within a nuclear family as the sum of frailties due to a putative major disease gene and a polygenic effect due to any remaining DS loci. Conditioning on these frailties we use the proportional hazards model for the risk of developing disease. We show that a test of linkage can be formulated as a test of zero variance due to a specific locus of the additive gamma frailties. Maximum likelihood estimation, using the EM algorithm, and likelihood ratio tests are employed for parameter estimation and tests of linkage. A simulation study presented indicates that the proposed method is well behaved and can be more powerful than the currently available allele-sharing based linkage methods. A breast cancer data example is used for illustration.  相似文献   

3.
Summary.  The isolation of DNA markers that are linked to interesting genes helps plant breeders to select parent plants that transmit useful traits to future generations. Such 'marker-assisted breeding and selection' heavily leans on statistical testing of associations between markers and a well-chosen trait. Statistical association analysis is guided by classical p -values or the false discovery rate and thus relies predominantly on the null hypothesis. The main concern of plant breeders, however, is to avoid missing an important alternative. To judge evidence from this perspective, we complement the traditional p -value with a one-sided 'alternative p -value' which summarizes evidence against a target alternative in the direction of the null hypothesis. This p -value measures 'impotence' as opposed to significance: how likely is it to observe an outcome as extreme as or more extreme than the one that was observed when data stem from the alternative? We show how a graphical inspection of both p -values can guide marker selection when the null and the alternative hypotheses have a comparable importance. We derive formal decision tools with balanced properties yielding different rejection regions for different markers. We apply our approach to study rye-grass plants.  相似文献   

4.
The authors consider affected‐sib‐pair analysis, in which genetic marker data are collected from families with at least two sibs affected by a disease under investigation. At any locus not linked to the disease gene, a sib pair shares 0, 1 and 2 alleles identical by descent (IBD) with probabilities of 1/4, 1/2 and 1/4, respectively. With linkage, the IBD value increases stochastically. Louis, Payami & Thomson (1987) and Holmans (1993) were the first ones who discovered that the IBD distribution satisfies the “possible triangle constraint” in some situations. Consequently, more powerful statistical procedures can be designed in detecting linkage. It is of statistical and genetical importance to investigate whether the possible triangle constraint remains true under general genetic models. In this paper, the authors introduce a new technique to prove the possible triangle constraint. Their proof is particularly simple for the single disease locus case. The general case is proved by linking IBD distributions between marker loci through a transition probability matrix.  相似文献   

5.
A marker's capacity to predict risk of a disease depends on disease prevalence in the target population and its classification accuracy, i.e. its ability to discriminate diseased subjects from non-diseased subjects. The latter is often considered an intrinsic property of the marker; it is independent of disease prevalence and hence more likely to be similar across populations than risk prediction measures. In this paper, we are interested in evaluating the population-specific performance of a risk prediction marker in terms of positive predictive value (PPV) and negative predictive value (NPV) at given thresholds, when samples are available from the target population as well as from another population. A default strategy is to estimate PPV and NPV using samples from the target population only. However, when the marker's classification accuracy as characterized by a specific point on the receiver operating characteristics (ROC) curve is similar across populations, borrowing information across populations allows increased efficiency in estimating PPV and NPV. We develop estimators that optimally combine information across populations. We apply this methodology to a cross-sectional study where we evaluate PCA3 as a risk prediction marker for prostate cancer among subjects with or without previous negative biopsy.  相似文献   

6.
An algorithm for functional evaluation of the likelihood of paternal and maternal recombination fractions for pedigree data is proposed. The idea behind the algorithm is that the probability of affected status and certain marker genotypes of ancestors is inherited by their descendants along with the inheritance of certain haplotypes. In this algorithm, the likelihood is evaluated by a single recursive call for each terminal sibling set along with the inheritance flow. The advantage of the algorithm is not only the simplicity of its implementation, but also its functional form of evaluation. The likelihood is obtained as a polynomial of the recombination fractions, making it easier to validate the likelihood more carefully, resulting in a more accurate localization of the disease locus. We report an experimental implementation of this algorithm in R, together with several practical applications.  相似文献   

7.
The availability of the next generation sequencing (NGS) technology in today's biomedical research has provided new opportunities in scientific discovery of genetic information. The high-throughput NGS technology, especially DNA-seq, is particularly useful in profiling a genome for the analysis of DNA copy number variants (CNVs). The read count (RC) data resulting from NGS technology are massive and information rich. How to exploit the RC data for accurate CNV detection has become a computational and statistical challenge. We provide a statistical online change point method to help detect CNVs in the sequencing RC data in this paper. This method uses the idea of online searching for change point (or breakpoint) with a Markov chain assumption on the breakpoints loci and an iterative computing process via a Bayesian framework. We illustrate that an online change-point detection method is particularly suitable for identifying CNVs in the RC data. The algorithm is applied to the publicly available NCI-H2347 lung cancer cell line sequencing reads data for locating the breakpoints. Extensive simulation studies have been carried out and results show the good behavior of the proposed algorithm. The algorithm is implemented in R and the codes are available upon request.  相似文献   

8.
孙怡帆等 《统计研究》2019,36(3):124-128
从大量基因中识别出致病基因是大数据下的一个十分重要的高维统计问题。基因间网络结构的存在使得对于致病基因的识别已从单个基因识别扩展到基因模块识别。从基因网络中挖掘出基因模块就是所谓的社区发现(或节点聚类)问题。绝大多数社区发现方法仅利用网络结构信息,而忽略节点本身的信息。Newman和Clauset于2016年提出了一个将二者有机结合的基于统计推断的社区发现方法(简称为NC方法)。本文以NC方法为案例,介绍统计方法在实际基因网络中的应用和取得的成果,并从统计学角度提出了改进措施。通过对NC方法的分析可以看出对于以基因网络为代表的非结构化数据,统计思想和原理在数据分析中仍然处于核心地位。而相应的统计方法则需要针对数据的特点及关心的问题进行相应的调整和优化。  相似文献   

9.
Recently-developed genotype imputation methods are a powerful tool for detecting untyped genetic variants that affect disease susceptibility in genetic association studies. However, existing imputation methods require individual-level genotype data, whereas in practice it is often the case that only summary data are available. For example this may occur because, for reasons of privacy or politics, only summary data are made available to the research community at large; or because only summary data are collected, as in DNA pooling experiments. In this article, we introduce a new statistical method that can accurately infer the frequencies of untyped genetic variants in these settings, and indeed substantially improve frequency estimates at typed variants in pooling experiments where observations are noisy. Our approach, which predicts each allele frequency using a linear combination of observed frequencies, is statistically straight-forward, and related to a long history of the use of linear methods for estimating missing values (e.g. Kriging). The main statistical novelty is our approach to regularizing the covariance matrix estimates, and the resulting linear predictors, which is based on methods from population genetics. We find that, besides being both fast and flexible - allowing new problems to be tackled that cannot be handled by existing imputation approaches purpose-built for the genetic context - these linear methods are also very accurate. Indeed, imputation accuracy using this approach is similar to that obtained by state-of-the art imputation methods that use individual-level data, but at a fraction of the computational cost.  相似文献   

10.
Summary. An advantage of randomization tests for small samples is that an exact P -value can be computed under an additive model. A disadvantage with very small sample sizes is that the resulting discrete distribution for P -values can make it mathematically impossible for a P -value to attain a particular degree of significance. We investigate a distribution of P -values that arises when several thousand randomization tests are conducted simultaneously using small samples, a situation that arises with microarray gene expression data. We show that the distribution yields valuable information regarding groups of genes that are differentially expressed between two groups: a treatment group and a control group. This distribution helps to categorize genes with varying degrees of overlap of genetic expression values between the two groups, and it helps to quantify the degree of overlap by using the P -value from a randomization test. Moreover, a statistical test is available that compares the actual distribution of P -values with an expected distribution if there are no genes that are differentially expressed. We demonstrate the method and illustrate the results by using a microarray data set involving a cell line for rheumatoid arthritis. A small simulation study evaluates the effect that correlated gene expression levels could have on results from the analysis.  相似文献   

11.
In this paper, we study the change-point inference problem motivated by the genomic data that were collected for the purpose of monitoring DNA copy number changes. DNA copy number changes or copy number variations (CNVs) correspond to chromosomal aberrations and signify abnormality of a cell. Cancer development or other related diseases are usually relevant to DNA copy number changes on the genome. There are inherited random noises in such data, therefore, there is a need to employ an appropriate statistical model for identifying statistically significant DNA copy number changes. This type of statistical inference is evidently crucial in cancer researches, clinical diagnostic applications, and other related genomic researches. For the high-throughput genomic data resulting from DNA copy number experiments, a mean and variance change point model (MVCM) for detecting the CNVs is appropriate. We propose to use a Bayesian approach to study the MVCM for the cases of one change and propose to use a sliding window to search for all CNVs on a given chromosome. We carry out simulation studies to evaluate the estimate of the locus of the DNA copy number change using the derived posterior probability. These simulation results show that the approach is suitable for identifying copy number changes. The approach is also illustrated on several chromosomes from nine fibroblast cancer cell line data (array-based comparative genomic hybridization data). All DNA copy number aberrations that have been identified and verified by karyotyping are detected by our approach on these cell lines.  相似文献   

12.
In an attempt to provide a statistical tool for disease screening and prediction, we propose a semiparametric approach to analysis of the Cox proportional hazards cure model in situations where the observations on the event time are subject to right censoring and some covariates are missing not at random. To facilitate the methodological development, we begin with semiparametric maximum likelihood estimation (SPMLE) assuming that the (conditional) distribution of the missing covariates is known. A variant of the EM algorithm is used to compute the estimator. We then adapt the SPMLE to a more practical situation where the distribution is unknown and there is a consistent estimator based on available information. We establish the consistency and weak convergence of the resulting pseudo-SPMLE, and identify a suitable variance estimator. The application of our inference procedure to disease screening and prediction is illustrated via empirical studies. The proposed approach is used to analyze the tuberculosis screening study data that motivated this research. Its finite-sample performance is examined by simulation.  相似文献   

13.
In this article, we use a latent class model (LCM) with prevalence modeled as a function of covariates to assess diagnostic test accuracy in situations where the true disease status is not observed, but observations on three or more conditionally independent diagnostic tests are available. A fast Monte Carlo expectation–maximization (MCEM) algorithm with binary (disease) diagnostic data is implemented to estimate parameters of interest; namely, sensitivity, specificity, and prevalence of the disease as a function of covariates. To obtain standard errors for confidence interval construction of estimated parameters, the missing information principle is applied to adjust information matrix estimates. We compare the adjusted information matrix-based standard error estimates with the bootstrap standard error estimates both obtained using the fast MCEM algorithm through an extensive Monte Carlo study. Simulation demonstrates that the adjusted information matrix approach estimates the standard error similarly with the bootstrap methods under certain scenarios. The bootstrap percentile intervals have satisfactory coverage probabilities. We then apply the LCM analysis to a real data set of 122 subjects from a Gynecologic Oncology Group study of significant cervical lesion diagnosis in women with atypical glandular cells of undetermined significance to compare the diagnostic accuracy of a histology-based evaluation, a carbonic anhydrase-IX biomarker-based test and a human papillomavirus DNA test.  相似文献   

14.
The paper considers the problem of phylogenetic tree construction. Our approach to the problem bases itself on a non-parametric paradigm seeking a model-free construction and symmetry on Types I and II errors. Trees are constructed through sequential tests using Hamming distance dissimilarity measures, from internal nodes to the tips. The method presents some novelties. The first, which is an advantage over the traditional methods, is that it is very fast, computationally efficient and feasible to be used for very large data sets. Two other novelties are its capacity to deal directly with multiple sequences per group (and built its statistical properties upon this richer information) and that the best tree will not have a predetermined number of tips, that is, the resulting number of tips will be statistically meaningful. We apply the method in two data sets of DNA sequences, illustrating that it can perform quite well even on very unbalanced designs. Computational complexities are also addressed. Supplemental materials are available online.  相似文献   

15.
The field of genetic epidemiology is growing rapidly with the realization that many important diseases are influenced by both genetic and environmental factors. For this reason, pedigree data are becoming increasingly valuable as a means of studying patterns of disease occurrence. Analysis of pedigree data is complicated by the lack of independence among family members and by the non-random sampling schemes used to ascertain families. An additional complicating factor is the variability in age at disease onset from one person to another. In developing statistical methods for analysing pedigree data, analytic results are often intractable, making simulation studies imperative for assessing the performance of proposed methods and estimators. In this paper, an algorithm is presented for simulating disease data in pedigrees, incorporating variable age at onset and genetic and environmental effects. Computational formulas are developed in the context of a proportional hazards model and assuming single ascertainment of families, but the methods can be easily generalized to alternative models. The algorithm is computationally efficient, making multi-dataset simulation studies feasible. Numerical examples are provided to demonstrate the methods.  相似文献   

16.
The increasing availability of high-throughput data, that is, massive quantities of molecular biology data arising from different types of experiments such as gene expression or protein microarrays, leads to the necessity of methods for summarizing the available information. As annotation quality improves it is becoming common to rely on biological annotation databases, such as the Gene Ontology (GO), to build functional profiles which characterize a set of genes or proteins using the distribution of their annotations in the database. In this work we describe a statistical model for such profiles, provide methods to compare profiles and develop inferential procedures to assess this comparison. An R-package implementing the methods will be available at publication time.  相似文献   

17.
The efficient use of surrogate or auxiliary information has been investigated within both model-based and design-based approaches to data analysis, particularly in the context of missing data. Here we consider the use of such data in epidemiological studies of disease incidence in which surrogate measures of disease status are available for all subjects at two time points, but definitive diagnoses are available only in stratified subsamples. We briefly review methods for the analysis of two-phase studies of disease prevalence at a single time point, and we discuss the extension of four of these methods to the analysis of incidence studies. Their performance is compared with special reference to a study of the incidence of senile dementia.  相似文献   

18.
In biomedical research, two or more biomarkers may be available for diagnosis of a particular disease. Selecting one single biomarker which ideally discriminate a diseased group from a healthy group is confront in a diagnostic process. Frequently, most of the people use the accuracy measure, area under the receiver operating characteristic (ROC) curve to choose the best diagnostic marker among the available markers for diagnosis. Some authors have tried to combine the multiple markers by an optimal linear combination to increase the discriminatory power. In this paper, we propose an alternative method that combines two continuous biomarkers by direct bivariate modeling of the ROC curve under log-normality assumption. The proposed method is applied to simulated data set and prostate cancer diagnostic biomarker data set.  相似文献   

19.
We develop Bayesian inference methods for a recently-emerging type of epigenetic data to study the transmission fidelity of DNA methylation patterns over cell divisions. The data consist of parent-daughter double-stranded DNA methylation patterns with each pattern coming from a single cell and represented as an unordered pair of binary strings. The data are technically difficult and time-consuming to collect, putting a premium on an efficient inference method. Our aim is to estimate rates for the maintenance and de novo methylation events that gave rise to the observed patterns, while accounting for measurement error. We model data at multiple sites jointly, thus using whole-strand information, and considerably reduce confounding between parameters. We also adopt a hierarchical structure that allows for variation in rates across sites without an explosion in the effective number of parameters. Our context-specific priors capture the expected stationarity, or near-stationarity, of the stochastic process that generated the data analyzed here. This expected stationarity is shown to greatly increase the precision of the estimation. Applying our model to a data set collected at the human FMR1 locus, we find that measurement errors, generally ignored in similar studies, occur at a non-trivial rate (inappropriate bisulfite conversion error: 1.6% with 80% CI: 0.9-2.3%). Accounting for these errors has a substantial impact on estimates of key biological parameters. The estimated average failure of maintenance rate and daughter de novo rate decline from 0.04 to 0.024 and from 0.14 to 0.07, respectively, when errors are accounted for. Our results also provide evidence that de novo events may occur on both parent and daughter strands: the median parent and daughter de novo rates are 0.08 (80% CI: 0.04-0.13) and 0.07 (80% CI: 0.04-0.11), respectively.  相似文献   

20.
Summary.  The paper presents work that creates a geographical information system database of European census data from 1870 to 2000. The database is integrated over space and time. Spatially it consists of regional level data for most of Europe; temporally it covers every decade from 1870 to 2000. Crucially the data have been interpolated onto the administrative units that were available in 2000, thus allowing contemporary population patterns to be understood in the light of the changes that have occurred since the late 19th century. The effect of interpolation error on the resulting estimates is explored. This database will provide a framework for much future analysis on long-term Europewide demographic processes over space and time.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号