首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
In this ‘Big Data’ era, statisticians inevitably encounter data generated from various disciplines. In particular, advances in bio‐technology have enabled scientists to produce enormous datasets in various biological experiments. In the last two decades, we have seen high‐throughput microarray data resulting from various genomic studies. Recently, next generation sequencing (NGS) technology has been playing an important role in the study of genomic features, resulting in vast amount of NGS data. One frequent application of NGS technology is in the study of DNA copy number variants (CNVs). The resulting NGS read count data are then used by researchers to formulate their various scientific approaches to accurately detect CNVs. Computational and statistical approaches to the detection of CNVs using NGS data are, however, very limited at present. In this review paper, we will focus on read‐depth analysis in CNV detection and give a brief summary of currently used statistical analysis methods in searching for CNVs using NGS data. In addition, based on the review, we discuss the challenges we face and future research directions. The ultimate goal of this review paper is to give a timely exposition of the surveyed statistical methods to researchers in related fields.  相似文献   

2.
In this paper, we study the change-point inference problem motivated by the genomic data that were collected for the purpose of monitoring DNA copy number changes. DNA copy number changes or copy number variations (CNVs) correspond to chromosomal aberrations and signify abnormality of a cell. Cancer development or other related diseases are usually relevant to DNA copy number changes on the genome. There are inherited random noises in such data, therefore, there is a need to employ an appropriate statistical model for identifying statistically significant DNA copy number changes. This type of statistical inference is evidently crucial in cancer researches, clinical diagnostic applications, and other related genomic researches. For the high-throughput genomic data resulting from DNA copy number experiments, a mean and variance change point model (MVCM) for detecting the CNVs is appropriate. We propose to use a Bayesian approach to study the MVCM for the cases of one change and propose to use a sliding window to search for all CNVs on a given chromosome. We carry out simulation studies to evaluate the estimate of the locus of the DNA copy number change using the derived posterior probability. These simulation results show that the approach is suitable for identifying copy number changes. The approach is also illustrated on several chromosomes from nine fibroblast cancer cell line data (array-based comparative genomic hybridization data). All DNA copy number aberrations that have been identified and verified by karyotyping are detected by our approach on these cell lines.  相似文献   

3.
The six recommendations made by the Guidelines for Assessment and Instruction in Statistics Education (GAISE) committee were first communicated in 2005 and more formally in 2010. In this article, 25 introductory statistics textbooks are examined to assess how well these textbooks have incorporated the three GAISE recommendations most relevant to implementation in textbooks (statistical literacy and thinking; use of real data; stress concepts over procedures). The implementation of another recommendation (using technology) is described but not assessed. In general, most textbooks appear to be adopting the GAISE recommendations reasonably well in both exposition and exercises. The textbooks are particularly adept at using real data, using real data well, and promoting statistical literacy. Textbooks are less adept—but still rated reasonably well, in general—at explaining concepts over procedures and promoting statistical thinking. In contrast, few textbooks have easy-usable glossaries of statistical terms to assist with understanding of statistical language and literacy development. Supplementary materials for this article are available online.  相似文献   

4.
We consider online monitoring of sequentially arising data as e.g. met in clinical information systems. The general focus thereby is to detect breakpoints, i.e. timepoints where the measurement series suddenly changes the general level. The method suggested is based on local estimation. In particular, local linear smoothing is combined by ridging with local constant smoothing. The procedure is demonstrated by examples and compared with other available online monitoring routines.  相似文献   

5.
We consider the detection of changes in the mean of a set of time series. The breakpoints are allowed to be series specific, and the series are assumed to be correlated. The correlation between the series is supposed to be constant along time but is allowed to take an arbitrary form. We show that such a dependence structure can be encoded in a factor model. Thanks to this representation, the inference of the breakpoints can be achieved via dynamic programming, which remains one the most efficient algorithms. We propose a model selection procedure to determine both the number of breakpoints and the number of factors. This proposed method is implemented in the FASeg R package, which is available on the CRAN. We demonstrate the performances of our procedure through simulation experiments and present an application to geodesic data.  相似文献   

6.
王星  马璇 《统计研究》2015,32(10):74-81
文章旨在研究受航空业动态定价机制影响下的机票价格序列变点估计模型,文中分析了机票价格u8序列数据的结构特点,提出了可用于高噪声数据环境下、阶梯状、带明显多变点的多阶段序列变点估计框架,该框架中级联组合了DBSCAN算法、EM-高斯混合模型聚类、凝聚层次聚类算法和基于乘积划分模型的变点估计方法等多种成熟的数据分析方法,通过对“北京-昆明”航线航班的实证分析,验证了数据分析框架的有效性和普遍适用性。  相似文献   

7.
New data collection and storage technologies have given rise to a new field of streaming data analytics, called real-time statistical methodology for online data analyses. Most existing online learning methods are based on homogeneity assumptions, which require the samples in a sequence to be independent and identically distributed. However, inter-data batch correlation and dynamically evolving batch-specific effects are among the key defining features of real-world streaming data such as electronic health records and mobile health data. This article is built under a state-space mixed model framework in which the observed data stream is driven by a latent state process that follows a Markov process. In this setting, online maximum likelihood estimation is made challenging by high-dimensional integrals and complex covariance structures. In this article, we develop a real-time Kalman-filter-based regression analysis method that updates both point estimates and their standard errors for fixed population average effects while adjusting for dynamic hidden effects. Both theoretical justification and numerical experiments demonstrate that our proposed online method has statistical properties similar to those of its offline counterpart and enjoys great computational efficiency. We also apply this method to analyze an electronic health record dataset.  相似文献   

8.
We introduce the transport–transform and the relative transport–transform metrics between finite point patterns on a general space, which provide a unified framework for earlier point pattern metrics, in particular the generalized spike time and the normalized and unnormalized optimal subpattern assignment metrics. Our main focus is on barycenters, i.e., minimizers of a q-th-order Fréchet functional with respect to these metrics. We present a heuristic algorithm that terminates in a local minimum and is shown to be fast and reliable in a simulation study. The algorithm serves as a general plug-in method that can be applied to point patterns on any state space where an appropriate algorithm for solving the location problem for individual points is available. We present applications to geocoded data of crimes in Euclidean space and on a street network, illustrating that barycenters serve as informative summary statistics. Our work is a first step toward statistical inference in covariate-based models of repeated point pattern observations.  相似文献   

9.
Mixture model-based clustering is widely used in many applications. In certain real-time applications the rapid increase of data size with time makes classical clustering algorithms too slow. An online clustering algorithm based on mixture models is presented in the context of a real-time flaw-diagnosis application for pressurized containers which uses data from acoustic emission signals. The proposed algorithm is a stochastic gradient algorithm derived from the classification version of the EM algorithm (CEM). It provides a model-based generalization of the well-known online k-means algorithm, able to handle non-spherical clusters. Using synthetic and real data sets, the proposed algorithm is compared with the batch CEM algorithm and the online EM algorithm. The three approaches generate comparable solutions in terms of the resulting partition when clusters are relatively well separated, but online algorithms become faster as the size of the available observations increases.  相似文献   

10.
The Buckley–James estimator (BJE) [J. Buckley and I. James, Linear regression with censored data, Biometrika 66 (1979), pp. 429–436] has been extended from right-censored (RC) data to interval-censored (IC) data by Rabinowitz et al. [D. Rabinowitz, A. Tsiatis, and J. Aragon, Regression with interval-censored data, Biometrika 82 (1995), pp. 501–513]. The BJE is defined to be a zero-crossing of a modified score function H(b), a point at which H(·) changes its sign. We discuss several approaches (for finding a BJE with IC data) which are extensions of the existing algorithms for RC data. However, these extensions may not be appropriate for some data, in particular, they are not appropriate for a cancer data set that we are analysing. In this note, we present a feasible iterative algorithm for obtaining a BJE. We apply the method to our data.  相似文献   

11.
Cumulative sum (cusum) methods can be used for monitoring processes and for retrospective (historical) data analysis. Most software only provides the former. The comment by Williamson that retrospective cusum analysis is a neglected area is still true. Though not in vogue, retrospective cusum analysis is useful for investigations such as benchmarking of processes, identifying causes of process decay, selecting reference data sets for typicality studies, and reporting of historical data. Even those texts which cover retrospective analyses, usually ignore the question of identifying multiple points of change (breakpoints), and present essentially manual methods for assessing single breakpoints. Most users of statistical methods want software solutions that are easy to use and require little user intervention or interpretation. Direct implementation of manual method does not give a user robust solution. Problems are illustrated. Attempts to use monitoring CuSums in retrospective analysis can also lead to errors. A practical recursive method is presented for breakpoint identification and significance assessment, which can be automated. Copyright © 2002 John Wiley & Sons, Ltd.  相似文献   

12.
The data cloning method is a new computational tool for computing maximum likelihood estimates in complex statistical models such as mixed models. This method is synthesized with integrated nested Laplace approximation to compute maximum likelihood estimates efficiently via a fast implementation in generalized linear mixed models. Asymptotic behavior of the hybrid data cloning method is discussed. The performance of the proposed method is illustrated through a simulation study and real examples. It is shown that the proposed method performs well and rightly justifies the theory. Supplemental materials for this article are available online.  相似文献   

13.
The increasing amount of data stored in the form of dynamic interactions between actors necessitates the use of methodologies to automatically extract relevant information. The interactions can be represented by dynamic networks in which most existing methods look for clusters of vertices to summarize the data. In this paper, a new framework is proposed in order to cluster the vertices while detecting change points in the intensities of the interactions. These change points are key in the understanding of the temporal interactions. The model used involves non-homogeneous Poisson point processes with cluster-dependent piecewise constant intensity functions and common discontinuity points. A variational expectation maximization algorithm is derived for inference. We show that the pruned exact linear time method, originally developed for change points detection in univariate time series, can be considered for the maximization step. This allows the detection of both the number of change points and their location. Experiments on artificial and real datasets are carried out, and the proposed approach is compared with related methods.  相似文献   

14.

Outlier detection is an inevitable step to most statistical data analyses. However, the mere detection of an outlying case does not always answer all scientific questions associated with that data point. Outlier detection techniques, classical and robust alike, will typically flag the entire case as outlying, or attribute a specific case weight to the entire case. In practice, particularly in high dimensional data, the outlier will most likely not be outlying along all of its variables, but just along a subset of them. If so, the scientific question why the case has been flagged as an outlier becomes of interest. In this article, a fast and efficient method is proposed to detect variables that contribute most to an outlier’s outlyingness. Thereby, it helps the analyst understand in which way an outlier lies out. The approach pursued in this work is to estimate the univariate direction of maximal outlyingness. It is shown that the problem of estimating that direction can be rewritten as the normed solution of a classical least squares regression problem. Identifying the subset of variables contributing most to outlyingness, can thus be achieved by estimating the associated least squares problem in a sparse manner. From a practical perspective, sparse partial least squares (SPLS) regression, preferably by the fast sparse NIPALS (SNIPLS) algorithm, is suggested to tackle that problem. The performed method is demonstrated to perform well both on simulated data and real life examples.

  相似文献   

15.
孙怡帆等 《统计研究》2019,36(3):124-128
从大量基因中识别出致病基因是大数据下的一个十分重要的高维统计问题。基因间网络结构的存在使得对于致病基因的识别已从单个基因识别扩展到基因模块识别。从基因网络中挖掘出基因模块就是所谓的社区发现(或节点聚类)问题。绝大多数社区发现方法仅利用网络结构信息,而忽略节点本身的信息。Newman和Clauset于2016年提出了一个将二者有机结合的基于统计推断的社区发现方法(简称为NC方法)。本文以NC方法为案例,介绍统计方法在实际基因网络中的应用和取得的成果,并从统计学角度提出了改进措施。通过对NC方法的分析可以看出对于以基因网络为代表的非结构化数据,统计思想和原理在数据分析中仍然处于核心地位。而相应的统计方法则需要针对数据的特点及关心的问题进行相应的调整和优化。  相似文献   

16.
The big data era demands new statistical analysis paradigms, since traditional methods often break down when datasets are too large to fit on a single desktop computer. Divide and Recombine (D&R) is becoming a popular approach for big data analysis, where results are combined over subanalyses performed in separate data subsets. In this article, we consider situations where unit record data cannot be made available by data custodians due to privacy concerns, and explore the concept of statistical sufficiency and summary statistics for model fitting. The resulting approach represents a type of D&R strategy, which we refer to as summary statistics D&R; as opposed to the standard approach, which we refer to as horizontal D&R. We demonstrate the concept via an extended Gamma–Poisson model, where summary statistics are extracted from different databases and incorporated directly into the fitting algorithm without having to combine unit record data. By exploiting the natural hierarchy of data, our approach has major benefits in terms of privacy protection. Incorporating the proposed modelling framework into data extraction tools such as TableBuilder by the Australian Bureau of Statistics allows for potential analysis at a finer geographical level, which we illustrate with a multilevel analysis of the Australian unemployment data. Supplementary materials for this article are available online.  相似文献   

17.
Change point estimation procedures simplify the efforts to search for and identify special causes in multivariate statistical process monitoring. After a signal is generated by the simultaneously used control charts or a single control chart, add-on change point procedure estimates the time of the change. In this study, multivariate joint change point estimation performance for simultaneous monitoring of both location and dispersion is compared under the assumption that various single charts are used to monitor the process. The change detection performance for several structural changes for the mean vector and covariance matrix is also discussed. It is concluded that choice of the control chart to obtain a signal may affect the change point detection performance.  相似文献   

18.
The literature displays change point detection problems in the context of one of the key issues that belong to testing statistical hypotheses. The main focus in this article is to review recent retrospective change point policies and propose new relevant procedures. Commonly, applied quality control purposes have declared statements of the change point problems. Various biostatistical and engineering applications cause consideration of an extended form of the change point problem. In this article, we consider parametric and distribution free generalized change point detection policies, attending to different contexts of optimality and robustness of the procedures. We conducted a broad Monte Carlo study to compare various parametric and nonparametric tests, also investigating a sensitivity of the change point detection policies with respect to assumptions required for correct executions of the procedures. An example based on real biomarker measurements is provided to judge our conclusions.  相似文献   

19.
The asymptotic results pertaining to the distribution of the log-likelihood ratio allow for the creation of a confidence region, which is a general extension of the confidence interval. Two- and three-dimensional regions can be displayed visually to describe the plausible region of the parameters of interest simultaneously. While most advanced statistical textbooks on inference discuss these asymptotic confidence regions, there is no exploration of how to numerically compute these regions for graphical purposes. This article demonstrates the application of a simple trigonometric transformation to compute two- and three-dimensional confidence regions; we transform the Cartesian coordinates of the parameters to create what we call the radial profile log-likelihood. The method is applicable to any distribution with a defined likelihood function, so it is not limited to specific data distributions or model paradigms. We describe the method along with the algorithm, follow with an example of our method, and end with an examination of computation time. Supplementary materials for this article are available online.  相似文献   

20.
The statistical analysis of change-point detection and estimation has received much attention recently. A time point such that observations follow a certain statistical distribution up to that point and a different distribution – commonly of the same functional form but different parameters after that point – is called a change-point. Multiple change-point problems arise when we have more than one change-point. This paper develops a method for multivariate normally distributed data to detect change-points and estimate within-segment parameters using maximum likelihood estimation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号