首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Concerning the task of integrating census and survey data from different sources as it is carried out by supranational statistical agencies, a formal metadata approach is investigated which supports data integration and table processing simultaneously. To this end, a metadata model is devised such that statistical query processing is accomplished by means of symbolic reasoning on machine-readable, operative metadata. As in databases, statistical queries are stated as formal expressions specifying declaratively what the intended output is; the operations necessary to retrieve appropriate available source data and to aggregate source data into the requested macrodata are derived mechanically. Using simple mathematics, this paper focuses particularly on the metadata model devised to harmonize semantically related data sources as well as the table model providing the principal data structure of the proposed system. Only an outline of the general design of a statistical information system based on the proposed metadata model is given and the state of development is summarized briefly.  相似文献   

2.
Empirical estimates of source statistical economic data such as trade flows, greenhouse gas emissions, or employment figures are always subject to uncertainty (stemming from measurement errors or confidentiality) but information concerning that uncertainty is often missing. This article uses concepts from Bayesian inference and the maximum entropy principle to estimate the prior probability distribution, uncertainty, and correlations of source data when such information is not explicitly provided. In the absence of additional information, an isolated datum is described by a truncated Gaussian distribution, and if an uncertainty estimate is missing, its prior equals the best guess. When the sum of a set of disaggregate data is constrained to match an aggregate datum, it is possible to determine the prior correlations among disaggregate data. If aggregate uncertainty is missing, all prior correlations are positive. If aggregate uncertainty is available, prior correlations can be either all positive, all negative, or a mix of both. An empirical example is presented, which reports relative uncertainties and correlation priors for the County Business Patterns database. In this example, relative uncertainties range from 1% to 80% and 20% of data pairs exhibit correlations below ?0.9 or above 0.9. Supplementary materials for this article are available online.  相似文献   

3.
Estimation in mixed linear models is, in general, computationally demanding, since applied problems may involve extensive data sets and large numbers of random effects. Existing computer algorithms are slow and/or require large amounts of memory. These problems are compounded in generalized linear mixed models for categorical data, since even approximate methods involve fitting of a linear mixed model within steps of an iteratively reweighted least squares algorithm. Only in models in which the random effects are hierarchically nested can the computations for fitting these models to large data sets be carried out rapidly. We describe a data augmentation approach to these computational difficulties in which we repeatedly fit an overlapping series of submodels, incorporating the missing terms in each submodel as 'offsets'. The submodels are chosen so that they have a nested random-effect structure, thus allowing maximum exploitation of the computational efficiency which is available in this case. Examples of the use of the algorithm for both metric and discrete responses are discussed, all calculations being carried out using macros within the MLwiN program.  相似文献   

4.
If unit‐level data are available, small area estimation (SAE) is usually based on models formulated at the unit level, but they are ultimately used to produce estimates at the area level and thus involve area‐level inferences. This paper investigates the circumstances under which using an area‐level model may be more effective. Linear mixed models (LMMs) fitted using different levels of data are applied in SAE to calculate synthetic estimators and empirical best linear unbiased predictors (EBLUPs). The performance of area‐level models is compared with unit‐level models when both individual and aggregate data are available. A key factor is whether there are substantial contextual effects. Ignoring these effects in unit‐level working models can cause biased estimates of regression parameters. The contextual effects can be automatically accounted for in the area‐level models. Using synthetic and EBLUP techniques, small area estimates based on different levels of LMMs are investigated in this paper by means of a simulation study.  相似文献   

5.
The road system in region RA of Leicester has vehicle detectors embedded in many of the network's road links. Vehicle counts from these detectors can provide transportation researchers with a rich source of data. However, for many projects it is necessary for researchers to have an estimate of origin-to-destination vehicle flow rates. Obtaining such estimates from data observed on individual road links is a non-trivial statistical problem, made more difficult in the present context by non-negligible measurement errors in the vehicle counts collected. The paper uses road link traffic count data from April 1994 to estimate the origin–destination flow rates for region RA. A model for the error prone traffic counts is developed, but the resulting likelihood is not available in closed form. Nevertheless, it can be smoothly approximated by using Monte Carlo integration. The approximate likelihood is combined with prior information from a May 1991 survey in a Bayesian framework. The posterior is explored using the Hastings–Metropolis algorithm, since its normalizing constant is not available. Preliminary findings suggest that the data are overdispersed according to the original model. Results for a revised model indicate that a degree of overdispersion exists, but that the estimates of origin–destination flow rates are quite insensitive to the change in model specification.  相似文献   

6.
Statistical database management systems keep raw, elementary and/or aggregated data and include query languages with facilities to calculate various statistics from this data. In this article we examine statistical database query languages with respect to the criteria identified and taxonomy developed in Ozsoyoglu and Ozsoyoglu (1985b). The criteria include statistical metadata and objects, aggregation features and interface to statistical packages. The taxonomy of statistical database query languages classifies them with respect to the data model used, the type of user interface and method of implementation. Temporal databases are rich sources of data for statistical analysis. Aggregation features of temporal query languages, as well as the issues in calculating aggregates from temporal data, are also examined.  相似文献   

7.
Suppose estimates are available for correlations between pairs of variables but that the matrix of correlation estimates is not positive definite. In various applications, having a valid correlation matrix is important in connection with follow‐up analyses that might, for example, involve sampling from a valid distribution. We present new methods for adjusting the initial estimates to form a proper, that is, nonnegative definite, correlation matrix. These are based on constructing certain pseudo‐likelihood functions, formed by multiplying together exact or approximate likelihood contributions associated with the individual correlations. Such pseudo‐likelihoods may then be maximized over the range of proper correlation matrices. They may also be utilized to form pseudo‐posterior distributions for the unknown correlation matrix, by factoring in relevant prior information for the separate correlations. We illustrate our methods on two examples from a financial time series and genomic pathway analysis.  相似文献   

8.
Collecting individual patient data has been described as the 'gold standard' for undertaking meta-analysis. If studies involve time-to-event outcomes, conducting a meta-analysis based on aggregate data can be problematical. Two meta-analyses of randomized controlled trials with time-to-event outcomes are used to illustrate the practicality and value of several proposed methods to obtain summary statistic estimates. In the first example the results suggest that further effort should be made to find unpublished trials. In the second example the use of aggregate data for trials where no individual patient data have been supplied allows the totality of evidence to be assessed and indicates previously unrecognized heterogeneity.  相似文献   

9.
We derived two methods to estimate the logistic regression coefficients in a meta-analysis when only the 'aggregate' data (mean values) from each study are available. The estimators we proposed are the discriminant function estimator and the reverse Taylor series approximation. These two methods of estimation gave similar estimators using an example of individual data. However, when aggregate data were used, the discriminant function estimators were quite different from the other two estimators. A simulation study was then performed to evaluate the performance of these two estimators as well as the estimator obtained from the model that simply uses the aggregate data in a logistic regression model. The simulation study showed that all three estimators are biased. The bias increases as the variance of the covariate increases. The distribution type of the covariates also affects the bias. In general, the estimator from the logistic regression using the aggregate data has less bias and better coverage probabilities than the other two estimators. We concluded that analysts should be cautious in using aggregate data to estimate the parameters of the logistic regression model for the underlying individual data.  相似文献   

10.
胡帆 《统计研究》2010,27(11):53-56
本文借鉴全面质量管理体系的概念,综合分析贯穿统计工作整个流程的统计调查数据质量管理的要素及作用。本文重点讨论了全面质量管理的流程和重点工作的布局;结合统计信息化的建设,特别讨论了相关工作规范、应用软件的作用,以及数据资源的建设和利用。  相似文献   

11.
The Poisson–Lindley distribution is a compound discrete distribution that can be used as an alternative to other discrete distributions, like the negative binomial. This paper develops approximate one-sided and equal-tailed two-sided tolerance intervals for the Poisson–Lindley distribution. Practical applications of the Poisson–Lindley distribution frequently involve large samples, thus we utilize large-sample Wald confidence intervals in the construction of our tolerance intervals. A coverage study is presented to demonstrate the efficacy of the proposed tolerance intervals. The tolerance intervals are also demonstrated using two real data sets. The R code developed for our discussion is briefly highlighted and included in the tolerance package.  相似文献   

12.
Short-run household electricity demand has been estimated with conditional demand models by a variety of authors using both aggregate data and disaggregate data. Disaggregate data are most desirable for estimating these models. However, in many cases, available disaggregate data may be inappropriate. Furthermore, disaggregate data may be unavailable altogether. In these cases, readily available aggregate data may be more appropriate. This article develops and evaluates an econometric technique to generate unbiased estimates of household electricity demand using such aggregate data.  相似文献   

13.
This paper is concerned with information retrieval. The basic problem is how to store large masses of data in such a way that whenever information regarding some particular aspect of the data is needed, such information is easily and efficiently retrieved. Work in this field is thus very important for organizations dealing with large classes of data.The consecutive retrieval (C-R) property defined by S.P. Ghosh is an important relation between a set of queries and a set of records. Its existence enables the design of information retrieval system with a minimal search time and no redundant storage in that the records can be organized in such a way that those pertinent to any query are stored in consecutive storage locations. The C-R property, however, can not exist between every arbitrary query set and every record set.A subset of the query set Q having the C-R property is called a C-R subset and a C-R subset having the maximum cardinality is called the maximal C-R subset. A partition of Q is called the C-R partition if every subset has the C-R property. A C-R partition with minimum number of subsets is called the minimal C-R partition. With respect to the set of all binary queries and the set of all binary records, it is shown that the maximal cardinality of a C-R subset is 2l-1 where l is the number of attributes concerned. A combinatorial characterization of a maximal C-R subset is also given. A lower bound on the number of subsets in a C-R partition and several examples which attain the lower bound are given. A general procedure for obtaining a minimal C-R partition which attains the lower bound is given provided the number of attributes is even.  相似文献   

14.
Abstract.  Methodology for Bayesian inference is considered for a stochastic epidemic model which permits mixing on both local and global scales. Interest focuses on estimation of the within- and between-group transmission rates given data on the final outcome. The model is sufficiently complex that the likelihood of the data is numerically intractable. To overcome this difficulty, an appropriate latent variable is introduced, about which asymptotic information is known as the population size tends to infinity. This yields a method for approximate inference for the true model. The methods are applied to real data, tested with simulated data, and also applied to a simple epidemic model for which exact results are available for comparison.  相似文献   

15.
The author proposes to use weighted likelihood to approximate Bayesian inference when no external or prior information is available. He proposes a weighted likelihood estimator that minimizes the empirical Bayes risk under relative entropy loss. He discusses connections among the weighted likelihood, empirical Bayes and James‐Stein estimators. Both simulated and real data sets are used for illustration purposes.  相似文献   

16.
17.
In the present paper we are going to extend the likelihood ratio test to the case in which the available experimental information involves fuzzy imprecision (more precisely, the observable events associated with the random experiment concerning the test may be characterized as fuzzy subsets of the sample space, as intended by Zadeh, 1965). In addition, we will approximate the immediate intractable extension, which is based on Zadeh’s probabilistic definition, by using the minimum inaccuracy principle of estimation from fuzzy data, that has been introduced in previous papers as an operative extension of the maximum likelihood method.  相似文献   

18.
The case for small area microdata   总被引:3,自引:2,他引:1  
Summary.  Census data are available in aggregate form for local areas and, through the samples of anonymized records (SARs), as samples of microdata for households and individuals. In 1991 there were two SAR files: a household file and an individual file. These have a high degree of detail on the census variables but little geographical detail, a situation that will be exacerbated for the 2001 SAR owing to the loss of district level geography on the individual SAR. The paper puts forward the case for an additional sample of microdata, also drawn from the census, that has much greater geographical detail. Small area microdata (SAM) are individual level records with local area identifiers and, to maintain confidentiality, reduced detail on the census variables. Population data from seven local authorities, including rural and urban areas, are used to define prototype samples of SAM. The rationale for SAM is given, with examples that demonstrate the role of local area information in the analysis of census data. Since there is a trade-off between the extent of local detail and the extent of detail on variables that can be made available, the confidentiality risk of SAM is assessed empirically. An indicative specification of the SAM is given, having taken into account the results of the confidentiality analysis.  相似文献   

19.
The area under the receiver operating characteristic (ROC) curve (AUC) is broadly accepted and often used as a diagnostic accuracy index. Moreover, the equality among the predictive capacity of two or more diagnostic systems is frequently checked from the comparison of their respective AUCs. In paired designs, this comparison is usually performed by using only the subjects who have collected all the necessary information, in the so-called available-case analysis. On the other hand, the presence of missing data is a frequent problem, especially in retrospective and observational studies. The loss of statistical power and the misuse of the available information (with the resulting ethical implications) are the main consequences. In this paper a non-parametric method is developed to exploit all available information. In order to approximate the distribution for the proposed statistic, the asymptotic distribution is computed and two different resampling plans are studied. In addition, the methodology is applied to a real-world medical problem. Finally, some technical issues are also reported in the Appendix.  相似文献   

20.
This article describes a method for computing approximate statistics for large data sets, when exact computations may not be feasible. Such situations arise in applications such as climatology, data mining, and information retrieval (search engines). The key to our approach is a modular approximation to the cumulative distribution function (cdf) of the data. Approximate percentiles (as well as many other statistics) can be computed from this approximate cdf. This enables the reduction of a potentially overwhelming computational exercise into smaller, manageable modules. We illustrate the properties of this algorithm using a simulated data set. We also examine the approximation characteristics of the approximate percentiles, using a von Mises functional type approach. In particular, it is shown that the maximum error between the approximate cdf and the actual cdf of the data is never more than 1% (or any other preset level). We also show that under assumptions of underlying smoothness of the cdf, the approximation error is much lower in an expected sense. Finally, we derive bounds for the approximation error of the percentiles themselves. Simulation experiments show that these bounds can be quite tight in certain circumstances.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号