Recent developments in engineering techniques for spatial data collection such as geographic information systems have resulted in an increasing need for methods to analyze large spatial datasets. These sorts of datasets can be found in various fields of the natural and social sciences. However, model fitting and spatial prediction using these large spatial datasets are impractically time-consuming, because of the necessary matrix inversions. Various methods have been developed to deal with this problem, including a reduced rank approach and a sparse matrix approximation. In this article, we propose a modification to an existing reduced rank approach to capture both the large- and small-scale spatial variations effectively. We have used simulated examples and an empirical data analysis to demonstrate that our proposed approach consistently performs well when compared with other methods. In particular, the performance of our new method does not depend on the dependence properties of the spatial covariance functions.  相似文献   

In this paper, we prove a novel result about the asymptotic distribution of a class of rank statistics that can be used in a situation where the number of replications is limited, whereas the number of treatments goes to infinity (large k, small n case).The results can be applied to, e.g., data from agricultural screening trials where usually the numbers of factor levels are large, but there are only few replications per factor level.  相似文献   

Many problems in the environmental and biological sciences involve the analysis of large quantities of data. Further, the data in these problems are often subject to various types of structure and, in particular, spatial dependence. Traditional model fitting often fails due to the size of the datasets since it is difficult to not only specify but also to compute with the full covariance matrix describing the spatial dependence. We propose a very general type of mixed model that has a random spatial component. Recognizing that spatial covariance matrices often exhibit a large number of zero or near-zero entries, covariance tapering is used to force near-zero entries to zero. Then, taking advantage of the sparse nature of such tapered covariance matrices, backfitting is used to estimate the fixed and random model parameters. The novelty of the paper is the combination of the two techniques, tapering and backfitting, to model and analyze spatial datasets several orders of magnitude larger than those datasets typically analyzed with conventional approaches. Results will be demonstrated with two datasets. The first consists of regional climate model output that is based on an experiment with two regional and two driver models arranged in a two-by-two layout. The second is microarray data used to build a profile of differentially expressed genes relating to cerebral vascular malformations, an important cause of hemorrhagic stroke and seizures.  相似文献   

Statistics and Computing - This paper introduces a framework for speeding up Bayesian inference conducted in presence of large datasets. We design a Markov chain whose transition kernel uses an...  相似文献   

In what follows, we introduce two Bayesian models for feature selection in high-dimensional data, specifically designed for the purpose of classification. We use two approaches to the problem: one which discards the components which have “almost constant” values (Model 1) and another which retains the components for which variations in-between the groups are larger than those within the groups (Model 2). We assume that p?n, i.e. the number of components p is much larger than the number of samples n, and that only few of those p components are useful for subsequent classification. We show that particular cases of the above two models recover familiar variance or ANOVA-based component selection. When one has only two classes and features are a priori independent, Model 2 reduces to the Feature Annealed Independence Rule (FAIR) introduced by Fan and Fan (2008) and can be viewed as a natural generalization of FAIR to the case of L>2 classes. The performance of the methodology is studies via simulations and using a biological dataset of animal communication signals comprising 43 groups of electric signals recorded from tropical South American electric knife fishes.  相似文献   

We derive the explicit form for the asymptotic posterior distribution of the balanced nested multi-way variance components model with the assumption that the number of the main factor levels tends to infinity while the number of any specific effect factor levels remains fixed. Under the multi-way model, we also study two different parameterizations, called the standard and the centering, and the relationship between certain quadratic forms of random effects and the variance component parameters. The asymptotic results are illustrated by a three-way model and by a simulation study under a two-way case.  相似文献   

AStA Advances in Statistical Analysis - This paper deals with the estimation of kurtosis on large datasets. It aims at overcoming two frequent limitations in applications: first, Pearson's...  相似文献   

Inspired by the ideas of column and row juxtaposition in Liu and Lin (2009) and level transformation in Yamada and Lin (1999), this paper presents a new method for constructing optimal supersaturated designs (SSDs). This method provides a convenient way to construct mixed-level designs with relatively large numbers of levels, avoiding the blind search and numerous calculations by computers. The goodness of the resulting SSDs is judged by the χ2 (Yamada and Lin, 1999 and Yamada and Matsui, 2002) and J2 (Xu, 2002) criteria. Some nice properties of the new designs are also provided.  相似文献   

We consider the problem of full information maximum likelihood (FIML) estimation in factor analysis when a majority of the data values are missing. The expectation–maximization (EM) algorithm is often used to find the FIML estimates, in which the missing values on manifest variables are included in complete data. However, the ordinary EM algorithm has an extremely high computational cost. In this paper, we propose a new algorithm that is based on the EM algorithm but that efficiently computes the FIML estimates. A significant improvement in the computational speed is realized by not treating the missing values on manifest variables as a part of complete data. When there are many missing data values, it is not clear if the FIML procedure can achieve good estimation accuracy. In order to investigate this, we conduct Monte Carlo simulations under a wide variety of sample sizes.  相似文献   


A common Bayesian hierarchical model is where high-dimensional observed data depend on high-dimensional latent variables that, in turn, depend on relatively few hyperparameters. When the full conditional distribution over latent variables has a known form, general MCMC sampling need only be performed on the low-dimensional marginal posterior distribution over hyperparameters. This improves on popular Gibbs sampling that computes over the full space. Sampling the marginal posterior over hyperparameters exhibits good scaling of compute cost with data size, particularly when that distribution depends on a low-dimensional sufficient statistic.  相似文献   

We propose different multivariate nonparametric tests for factorial designs and derive their asymptotic distribution for the situation where the number of replications is limited, whereas the number of treatments goes to infinity (large a, small n case). The tests are based on separate rankings for the different variables, and they are therefore invariant under separate monotone transformations of the individual variables.  相似文献   

The purpose of this paper is systematically to derive the general upper bound for the number of blocks having a given number of treatments common with a given block of certain incomplete block designs. The approach adopted here is based on the spectral decomposition of NN for the incidence matrix N of a design, where N' is the transpose of the matrix N. This approach will lead us to upper bounds for incomplete block designs, in particular for a large number of partially balanced incomplete block (PBIB) designs, which are not covered with the standard approach (Shah 1964, 1966), Kapadia (1966)) of using well known relations between blocks of the designs and their association schemes. Several results concerning block structure of block designs are also derived from the main theorem. Finally, further generalizations of the main theorem are discussed with some illustrations.  相似文献   

M. Akbari 《Statistics》2013,47(3):633-640
In this paper, using the completeness properties of the sequence of functions {hn(x)=(?log x)n, 0<x<1, n≥1}, some characterization results are established. The results are based on the number of observations near the k-records. It is shown that the equality of the moment of the appropriate subsequence of the number of observations near to upper and lower k-records is a characteristic property of symmetric distributions. Since ordinary record values are contained in the k-records, the results hold for usual records as a particular case.  相似文献   


Life tables used in life insurance are often calibrated to show the survival function of the age of death distribution at exact integer ages. Actuaries usually make fractional age assumptions (FAAs) when having to value payments that are not restricted to integer ages. Traditional FAAs have the advantage of simplicity but cannot guarantee to capture precisely the real trends of the survival functions and sometimes even result in a non intuitive overall shape of the force of mortality. In fact, an FAA is an interpolation between integer age values which are accepted as given. In this article, we introduce Kriging model, which is widely used as a metamodel for expensive simulations, to fit the survival function at integer ages, and furthermore use the precisely constructed survival function to build the force of mortality and the life expectancy. The experimental results obtained from a simulated life table (Makehamized life table) and two “real” life tables (the Chinese and US life tables) show that these actuarial quantities (survival function, force of mortality, and life expectancy) presented by Kriging model are much more accurate than those presented by commonly-used FAAs: the uniform distribution of death (UDD) assumption, the constant force assumption, and the Balducci assumption.  相似文献   

In this paper, we study the asymptotic properties of the adaptive Lasso estimators in high-dimensional generalized linear models. The consistency of the adaptive Lasso estimator is obtained. We show that, if a reasonable initial estimator is available, under appropriate conditions, the adaptive Lasso correctly selects covariates with non zero coefficients with probability converging to one, and that the estimators of non zero coefficients have the same asymptotic distribution they would have if the zero coefficients were known in advance. Thus, the adaptive Lasso has an Oracle property. The results are examined by some simulations and a real example.  相似文献   

Recently, several methodologies to perform geostatistical analysis of functional data have been proposed. All of them assume that the spatial functional process considered is stationary. However, in practice, we often have nonstationary functional data because there exists an explicit spatial trend in the mean. Here, we propose a methodology to extend kriging predictors for functional data to the case where the mean function is not constant through the region of interest. We consider an approach based on the classical residual kriging method used in univariate geostatistics. We propose a three steps procedure. Initially, a functional regression model is used to detrend the mean. Then we apply kriging methods for functional data to the regression residuals to predict a residual curve at a non-data location. Finally, the prediction curve is obtained as the sum of the trend and the residual prediction. We apply the methodology to salinity data corresponding to 21 salinity curves recorded at the Ciénaga Grande de Santa Marta estuary, located in the Caribbean coast of Colombia. A cross-validation analysis was carried out to track the performance of the proposed methodology.  相似文献   

