首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The computational demand required to perform inference using Markov chain Monte Carlo methods often obstructs a Bayesian analysis. This may be a result of large datasets, complex dependence structures, or expensive computer models. In these instances, the posterior distribution is replaced by a computationally tractable approximation, and inference is based on this working model. However, the error that is introduced by this practice is not well studied. In this paper, we propose a methodology that allows one to examine the impact on statistical inference by quantifying the discrepancy between the intractable and working posterior distributions. This work provides a structure to analyse model approximations with regard to the reliability of inference and computational efficiency. We illustrate our approach through a spatial analysis of yearly total precipitation anomalies where covariance tapering approximations are used to alleviate the computational demand associated with inverting a large, dense covariance matrix.  相似文献   

2.
Hidden Markov random field models provide an appealing representation of images and other spatial problems. The drawback is that inference is not straightforward for these models as the normalisation constant for the likelihood is generally intractable except for very small observation sets. Variational methods are an emerging tool for Bayesian inference and they have already been successfully applied in other contexts. Focusing on the particular case of a hidden Potts model with Gaussian noise, we show how variational Bayesian methods can be applied to hidden Markov random field inference. To tackle the obstacle of the intractable normalising constant for the likelihood, we explore alternative estimation approaches for incorporation into the variational Bayes algorithm. We consider a pseudo-likelihood approach as well as the more recent reduced dependence approximation of the normalisation constant. To illustrate the effectiveness of these approaches we present empirical results from the analysis of simulated datasets. We also analyse a real dataset and compare results with those of previous analyses as well as those obtained from the recently developed auxiliary variable MCMC method and the recursive MCMC method. Our results show that the variational Bayesian analyses can be carried out much faster than the MCMC analyses and produce good estimates of model parameters. We also found that the reduced dependence approximation of the normalisation constant outperformed the pseudo-likelihood approximation in our analysis of real and synthetic datasets.  相似文献   

3.
With the rapid growth of modern technology, many biomedical studies are being conducted to collect massive datasets with volumes of multi‐modality imaging, genetic, neurocognitive and clinical information from increasingly large cohorts. Simultaneously extracting and integrating rich and diverse heterogeneous information in neuroimaging and/or genomics from these big datasets could transform our understanding of how genetic variants impact brain structure and function, cognitive function and brain‐related disease risk across the lifespan. Such understanding is critical for diagnosis, prevention and treatment of numerous complex brain‐related disorders (e.g., schizophrenia and Alzheimer's disease). However, the development of analytical methods for the joint analysis of both high‐dimensional imaging phenotypes and high‐dimensional genetic data, a big data squared (BD2) problem, presents major computational and theoretical challenges for existing analytical methods. Besides the high‐dimensional nature of BD2, various neuroimaging measures often exhibit strong spatial smoothness and dependence and genetic markers may have a natural dependence structure arising from linkage disequilibrium. We review some recent developments of various statistical techniques for imaging genetics, including massive univariate and voxel‐wise approaches, reduced rank regression, mixture models and group sparse multi‐task regression. By doing so, we hope that this review may encourage others in the statistical community to enter into this new and exciting field of research. The Canadian Journal of Statistics 47: 108–131; 2019 © 2019 Statistical Society of Canada  相似文献   

4.
We present a scalable Bayesian modelling approach for identifying brain regions that respond to a certain stimulus and use them to classify subjects. More specifically, we deal with multi‐subject electroencephalography (EEG) data with a binary response distinguishing between alcoholic and control groups. The covariates are matrix‐variate with measurements taken from each subject at different locations across multiple time points. EEG data have a complex structure with both spatial and temporal attributes. We use a divide‐and‐conquer strategy and build separate local models, that is, one model at each time point. We employ Bayesian variable selection approaches using a structured continuous spike‐and‐slab prior to identify the locations that respond to a certain stimulus. We incorporate the spatio‐temporal structure through a Kronecker product of the spatial and temporal correlation matrices. We develop a highly scalable estimation algorithm, using likelihood approximation, to deal with large number of parameters in the model. Variable selection is done via clustering of the locations based on their duration of activation. We use scoring rules to evaluate the prediction performance. Simulation studies demonstrate the efficiency of our scalable algorithm in terms of estimation and fast computation. We present results using our scalable approach on a case study of multi‐subject EEG data.  相似文献   

5.
Bayesian hierarchical modeling with Gaussian process random effects provides a popular approach for analyzing point-referenced spatial data. For large spatial data sets, however, generic posterior sampling is infeasible due to the extremely high computational burden in decomposing the spatial correlation matrix. In this paper, we propose an efficient algorithm—the adaptive griddy Gibbs (AGG) algorithm—to address the computational issues with large spatial data sets. The proposed algorithm dramatically reduces the computational complexity. We show theoretically that the proposed method can approximate the real posterior distribution accurately. The sufficient number of grid points for a required accuracy has also been derived. We compare the performance of AGG with that of the state-of-the-art methods in simulation studies. Finally, we apply AGG to spatially indexed data concerning building energy consumption.  相似文献   

6.
Spatial modeling of consumer response data has gained increased interest recently in the marketing literature. In this paper, we extend the (spatial) multi-scale model by incorporating both spatial and temporal dimensions in the dynamic multi-scale spatiotemporal modeling approach. Our empirical application with a US company’s catalog purchase data for the period 1997–2001 reveals a nested geographic market structure that spans geopolitical boundaries such as state borders. This structure identifies spatial clusters of consumers who exhibit similar spatiotemporal behavior, thus pointing to the importance of emergent geographic structure, emergent nested structure and dynamic patterns in multi-resolution methods. The multi-scale model also has better performance in estimation and prediction compared with several spatial and spatiotemporal models and uses a scalable and computationally efficient Markov chain Monte Carlo method that makes it suitable for analyzing large spatiotemporal consumer purchase datasets.KEYWORDS: Clustering, dynamic linear models, empirical Bayes methods, Markov chain Monte Carlo methods, multi-scale modeling, spatial models  相似文献   

7.
Many problems in the environmental and biological sciences involve the analysis of large quantities of data. Further, the data in these problems are often subject to various types of structure and, in particular, spatial dependence. Traditional model fitting often fails due to the size of the datasets since it is difficult to not only specify but also to compute with the full covariance matrix describing the spatial dependence. We propose a very general type of mixed model that has a random spatial component. Recognizing that spatial covariance matrices often exhibit a large number of zero or near-zero entries, covariance tapering is used to force near-zero entries to zero. Then, taking advantage of the sparse nature of such tapered covariance matrices, backfitting is used to estimate the fixed and random model parameters. The novelty of the paper is the combination of the two techniques, tapering and backfitting, to model and analyze spatial datasets several orders of magnitude larger than those datasets typically analyzed with conventional approaches. Results will be demonstrated with two datasets. The first consists of regional climate model output that is based on an experiment with two regional and two driver models arranged in a two-by-two layout. The second is microarray data used to build a profile of differentially expressed genes relating to cerebral vascular malformations, an important cause of hemorrhagic stroke and seizures.  相似文献   

8.
Remote sensing of the earth with satellites yields datasets that can be massive in size, nonstationary in space, and non‐Gaussian in distribution. To overcome computational challenges, we use the reduced‐rank spatial random effects (SRE) model in a statistical analysis of cloud‐mask data from NASA's Moderate Resolution Imaging Spectroradiometer (MODIS) instrument on board NASA's Terra satellite. Parameterisations of cloud processes are the biggest source of uncertainty and sensitivity in different climate models’ future projections of Earth's climate. An accurate quantification of the spatial distribution of clouds, as well as a rigorously estimated pixel‐scale clear‐sky‐probability process, is needed to establish reliable estimates of cloud‐distributional changes and trends caused by climate change. Here we give a hierarchical spatial‐statistical modelling approach for a very large spatial dataset of 2.75 million pixels, corresponding to a granule of MODIS cloud‐mask data, and we use spatial change‐of‐Support relationships to estimate cloud fraction at coarser resolutions. Our model is non‐Gaussian; it postulates a hidden process for the clear‐sky probability that makes use of the SRE model, EM‐estimation, and optimal (empirical Bayes) spatial prediction of the clear‐sky‐probability process. Measures of prediction uncertainty are also given.  相似文献   

9.
In spatial statistics, models are often constructed based on some common, but possible restrictive assumptions for the underlying spatial process, including Gaussianity as well as stationarity and isotropy. However, these assumptions are frequently violated in applied problems. In order to simultaneously handle skewness and non-homogeneity (i.e., non-stationarity and anisotropy), we develop the fixed rank kriging model through the use of skew-normal distribution for its non-spatial latent variables. Our approach to spatial modeling is easy to implement and also provides a great flexibility in adjusting to skewed and large datasets with heterogeneous correlation structures. We adopt a Bayesian framework for our analysis, and describe a simple MCMC algorithm for sampling from the posterior distribution of the model parameters and performing spatial prediction. Through a simulation study, we demonstrate that the proposed model could detect departures from normality and, for illustration, we analyze a synthetic dataset of CO\(_2\) measurements. Finally, to deal with multivariate spatial data showing some degree of skewness, a multivariate extension of the model is also provided.  相似文献   

10.
Spatial outliers are spatially referenced objects whose non spatial attribute values are significantly different from the corresponding values in their spatial neighborhoods. In other words, a spatial outlier is a local instability or an extreme observation that deviates significantly in its spatial neighborhood, but possibly not be in the entire dataset. In this article, we have proposed a novel spatial outlier detection algorithm, location quotient (LQ) for multiple attributes spatial datasets, and compared its performance with the well-known mean and median algorithms for multiple attributes spatial datasets, in the literature. In particular, we have applied the mean, median, and LQ algorithms on a real dataset and on simulated spatial datasets of 13 different sizes to compare their performances. In addition, we have calculated area under the curve values in all the cases, which shows that our proposed algorithm is more powerful than the mean and median algorithms in almost all the considered cases and also plotted receiver operating characteristic curves in some cases.  相似文献   

11.
Models for geostatistical data introduce spatial dependence in the covariance matrix of location-specific random effects. This is usually defined to be a parametric function of the distances between locations. Bayesian formulations of such models overcome asymptotic inference and estimation problems involved in maximum likelihood-based approaches and can be fitted using Markov chain Monte Carlo (MCMC) simulation. The MCMC implementation, however, requires repeated inversions of the covariance matrix which makes the problem computationally intensive, especially for large number of locations. In the present work, we propose to convert the spatial covariance matrix to a sparse matrix and compare a number of numerical algorithms especially suited within the MCMC framework in order to accelerate large matrix inversion. The algorithms are assessed empirically on simulated datasets of different size and sparsity. We conclude that the band solver applied after ordering the distance matrix reduces the computational time in inverting covariance matrices substantially.  相似文献   

12.
Classical multivariate methods are often based on the sample covariance matrix, which is very sensitive to outlying observations. One alternative to the covariance matrix is the affine equivariant rank covariance matrix (RCM) that has been studied in Visuri et al. [2003. Affine equivariant multivariate rank methods. J. Statist. Plann. Inference 114, 161–185]. In this article we assume that the covariance matrix is partially known and study how to estimate the corresponding RCM. We use the properties that the RCM is affine equivariant and that the RCM is proportional to the inverse of the regular covariance matrix, and hence reduce the problem of estimating the original RCM to estimating marginal rank covariance matrices. This is a great computational advantage when the dimension of the original data vector is large.  相似文献   

13.
Most applications in spatial statistics involve modeling of complex spatial–temporal dependency structures, and many of the problems of space and time modeling can be overcome by using separable processes. This subclass of spatial–temporal processes has several advantages, including rapid fitting and simple extensions of many techniques developed and successfully used in time series and classical geostatistics. In particular, a major advantage of these processes is that the covariance matrix for a realization can be expressed as the Kronecker product of two smaller matrices that arise separately from the temporal and purely spatial processes, and hence its determinant and inverse are easily determinable. However, these separable models are not always realistic, and there are no formal tests for separability of general spatial–temporal processes. We present here a formal method to test for separability. Our approach can be also used to test for lack of stationarity of the process. The beauty of our approach is that by using spectral methods the mechanics of the test can be reduced to a simple two-factor analysis of variance (ANOVA) procedure. The approach we propose is based on only one realization of the spatial–temporal process.We apply the statistical methods proposed here to test for separability and stationarity of spatial–temporal ozone fields using data provided by the US Environmental Protection Agency (EPA).  相似文献   

14.
15.
Cui  Ruifei  Groot  Perry  Heskes  Tom 《Statistics and Computing》2019,29(2):311-333

We consider the problem of causal structure learning from data with missing values, assumed to be drawn from a Gaussian copula model. First, we extend the ‘Rank PC’ algorithm, designed for Gaussian copula models with purely continuous data (so-called nonparanormal models), to incomplete data by applying rank correlation to pairwise complete observations and replacing the sample size with an effective sample size in the conditional independence tests to account for the information loss from missing values. When the data are missing completely at random (MCAR), we provide an error bound on the accuracy of ‘Rank PC’ and show its high-dimensional consistency. However, when the data are missing at random (MAR), ‘Rank PC’ fails dramatically. Therefore, we propose a Gibbs sampling procedure to draw correlation matrix samples from mixed data that still works correctly under MAR. These samples are translated into an average correlation matrix and an effective sample size, resulting in the ‘Copula PC’ algorithm for incomplete data. Simulation study shows that: (1) ‘Copula PC’ estimates a more accurate correlation matrix and causal structure than ‘Rank PC’ under MCAR and, even more so, under MAR and (2) the usage of the effective sample size significantly improves the performance of ‘Rank PC’ and ‘Copula PC.’ We illustrate our methods on two real-world datasets: riboflavin production data and chronic fatigue syndrome data.

  相似文献   

16.
We consider estimation of unknown parameters of a Burr XII distribution based on progressively Type I hybrid censored data. The maximum likelihood estimates are obtained using an expectation maximization algorithm. Asymptotic interval estimates are constructed from the Fisher information matrix. We obtain Bayes estimates under the squared error loss function using the Lindley method and Metropolis–Hastings algorithm. The predictive estimates of censored observations are obtained and the corresponding prediction intervals are also constructed. We compare the performance of the different methods using simulations. Two real datasets have been analyzed for illustrative purposes.  相似文献   

17.
Recently, the methods used to estimate monotonic regression (MR) models have been substantially improved, and some algorithms can now produce high-accuracy monotonic fits to multivariate datasets containing over a million observations. Nevertheless, the computational burden can be prohibitively large for resampling techniques in which numerous datasets are processed independently of each other. Here, we present efficient algorithms for estimation of confidence limits in large-scale settings that take into account the similarity of the bootstrap or jackknifed datasets to which MR models are fitted. In addition, we introduce modifications that substantially improve the accuracy of MR solutions for binary response variables. The performance of our algorithms is illustrated using data on death in coronary heart disease for a large population. This example also illustrates that MR can be a valuable complement to logistic regression.  相似文献   

18.
High-dimensional datasets have exploded into many fields of research, challenging our interpretation of the classic dimension reduction technique, Principal Component Analysis (PCA). Recently proposed Sparse PCA methods offer useful insight into understanding complex data structures. This article compares three Sparse PCA methods through extensive simulations, with the aim of providing guidelines as to which method to choose under a variety of data structures, as dictated by the variance-covariance matrix. A real gene expression dataset is used to illustrate an application of Sparse PCA in practice and show how to link simulation results with real-world problems.  相似文献   

19.
The big data era demands new statistical analysis paradigms, since traditional methods often break down when datasets are too large to fit on a single desktop computer. Divide and Recombine (D&R) is becoming a popular approach for big data analysis, where results are combined over subanalyses performed in separate data subsets. In this article, we consider situations where unit record data cannot be made available by data custodians due to privacy concerns, and explore the concept of statistical sufficiency and summary statistics for model fitting. The resulting approach represents a type of D&R strategy, which we refer to as summary statistics D&R; as opposed to the standard approach, which we refer to as horizontal D&R. We demonstrate the concept via an extended Gamma–Poisson model, where summary statistics are extracted from different databases and incorporated directly into the fitting algorithm without having to combine unit record data. By exploiting the natural hierarchy of data, our approach has major benefits in terms of privacy protection. Incorporating the proposed modelling framework into data extraction tools such as TableBuilder by the Australian Bureau of Statistics allows for potential analysis at a finer geographical level, which we illustrate with a multilevel analysis of the Australian unemployment data. Supplementary materials for this article are available online.  相似文献   

20.
The main focus of our paper is to compare the performance of different model selection criteria used for multivariate reduced rank time series. We consider one of the most commonly used reduced rank model, that is, the reduced rank vector autoregression (RRVAR (p, r)) introduced by Velu et al. [Reduced rank models for multiple time series. Biometrika. 1986;7(31):105–118]. In our study, the most popular model selection criteria are included. The criteria are divided into two groups, that is, simultaneous selection and two-step selection criteria, accordingly. Methods from the former group select both an autoregressive order p and a rank r simultaneously, while in the case of two-step criteria, first an optimal order p is chosen (using model selection criteria intended for the unrestricted VAR model) and then an optimal rank r of coefficient matrices is selected (e.g. by means of sequential testing). Considered model selection criteria include well-known information criteria (such as Akaike information criterion, Schwarz criterion, Hannan–Quinn criterion, etc.) as well as widely used sequential tests (e.g. the Bartlett test) and the bootstrap method. An extensive simulation study is carried out in order to investigate the efficiency of all model selection criteria included in our study. The analysis takes into account 34 methods, including 6 simultaneous methods and 28 two-step approaches, accordingly. In order to carefully analyse how different factors affect performance of model selection criteria, we consider over 150 simulation settings. In particular, we investigate the influence of the following factors: time series dimension, different covariance structure, different level of correlation among components and different level of noise (variance). Moreover, we analyse the prediction accuracy concerned with the application of the RRVAR model and compare it with results obtained for the unrestricted vector autoregression. In this paper, we also present a real data application of model selection criteria for the RRVAR model using the Polish macroeconomic time series data observed in the period 1997–2007.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号