首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Statistical database management systems keep raw, elementary and/or aggregated data and include query languages with facilities to calculate various statistics from this data. In this article we examine statistical database query languages with respect to the criteria identified and taxonomy developed in Ozsoyoglu and Ozsoyoglu (1985b). The criteria include statistical metadata and objects, aggregation features and interface to statistical packages. The taxonomy of statistical database query languages classifies them with respect to the data model used, the type of user interface and method of implementation. Temporal databases are rich sources of data for statistical analysis. Aggregation features of temporal query languages, as well as the issues in calculating aggregates from temporal data, are also examined.  相似文献   

2.
Random sampling from databases: a survey   总被引:2,自引:0,他引:2  
This paper reviews recent literature on techniques for obtaining random samples from databases. We begin with a discussion of why one would want to include sampling facilities in database management systems. We then review basic sampling techniques used in constructing DBMS sampling algorithms, e.g. acceptance/rejection and reservoir sampling. A discussion of sampling from various data structures follows: B + trees, hash files, spatial data structures (including R-trees and quadtrees). Algorithms for sampling from simple relational queries, e.g. single relational operators such as selection, intersection, union, set difference, projection, and join are then described. We then describe sampling for estimation of aggregates (e.g. the size of query results). Here we discuss both clustered sampling, and sequential sampling approaches. Decision-theoretic approaches to sampling for query optimization are reviewed.  相似文献   

3.
Modelling the underlying stochastic process is one of the main goals in the study of many dynamic phenomena, such as signal processing, system identification and time series. The issue is often addressed within the framework of ARMA (Autoregressive Moving Average) paradigm, so that the related task of identification of the ‘true’ order is crucial. As it is well known, the effectiveness of such an approach may be seriously compromised by misspecification errors since they may affect model capabilities in capturing dynamic structures of the process. As a result, inference and empirical outcomes may be heavily misleading. Despite the big number of available approaches aimed at determining the order of an ARMA model, the issue is still open. In this paper, we bring the problem in the framework of bootstrap theory in conjunction with the information-based criterion of Akaike (AIC), and a new method for ARMA model selection will be presented. A theoretical justification for the proposed approach as well as an evaluation of its small sample performances, via simulation study, are given.  相似文献   

4.
Data analysts often explore a large database to identify the data of interest, but may not be able to specify the exact query to send to the database. A manual data exploration process is labor intensive and time-consuming. In the new paradigm of system-aided interactive data exploration, the Database Management System presents the samples to the user and engages the user in an interactive exploration process to identify the user interest. In this article, we examine a number of initial sampling techniques to identify at least one positive (i.e., interesting) sample and compare them both theoretically and empirically.  相似文献   

5.
Concerning the task of integrating census and survey data from different sources as it is carried out by supranational statistical agencies, a formal metadata approach is investigated which supports data integration and table processing simultaneously. To this end, a metadata model is devised such that statistical query processing is accomplished by means of symbolic reasoning on machine-readable, operative metadata. As in databases, statistical queries are stated as formal expressions specifying declaratively what the intended output is; the operations necessary to retrieve appropriate available source data and to aggregate source data into the requested macrodata are derived mechanically. Using simple mathematics, this paper focuses particularly on the metadata model devised to harmonize semantically related data sources as well as the table model providing the principal data structure of the proposed system. Only an outline of the general design of a statistical information system based on the proposed metadata model is given and the state of development is summarized briefly.  相似文献   

6.
The Randomized Response (RR) technique is a well-established interview procedure which guarantees privacy protection in social surveys dealing with sensitive items. The RR method assumes a stochastic mechanism to create uncertainty about the true status of the respondents in order to ensure privacy protection and to avoid tendencies to dissimulate or respond in a socially desirable direction. A very general model for the RR method was introduced by Franklin (Commun Stat Theory Methods 18:489?C505, 1989) when a single-sensitive question is under study. However, since social surveys are often based on questionnaires containing more than a single-sensitive question, the analysis of multivariate RR data is of considerable interest. This paper focuses on the generalization of the Franklin model in a multiple-sensitive question setting and on related inferential issues.  相似文献   

7.
This article analyzes the effects of multicollienarity on the maximum likelihood (ML) estimator for the Tobit regression model. Furthermore, a ridge regression (RR) estimator is proposed since the mean squared error (MSE) of ML becomes inflated when the regressors are collinear. To investigate the performance of the traditional ML and the RR approaches we use Monte Carlo simulations where the MSE is used as performance criteria. The simulated results indicate that the RR approach should always be preferred to the ML estimation method.  相似文献   

8.
As an effective tool for data storage, processing, and computing, ontology has been used in many fields of computer science and information technology. By means of its powerful performance on semantic query and knowledge extraction, domain ontology has been built on various disciplines such as biology, pharmaceutics, geography, chemistry, etc. and been smoothly employed for their engineering applications. In these ontology applications, we aim to get an optimal ontology function which maps each ontology to a real number and then determine the similarity between concepts by the distance of their corresponding real numbers. In former ontology learning approaches, all the instances in the training sample have equal status in the learning process. In this article, we present the disequilibrium multi-dividing ontology algorithm in which the important ontology data will be highlighted during the learning, and the relevant ontology data tend to be eliminated. Four experiments are designed to test the serviceability of our disequilibrium multi-dividing algorithm from angles of ontology similarity measuring and ontology mapping construction.  相似文献   

9.
Spatio-temporal processes are often high-dimensional, exhibiting complicated variability across space and time. Traditional state-space model approaches to such processes in the presence of uncertain data have been shown to be useful. However, estimation of state-space models in this context is often problematic since parameter vectors and matrices are of high dimension and can have complicated dependence structures. We propose a spatio-temporal dynamic model formulation with parameter matrices restricted based on prior scientific knowledge and/or common spatial models. Estimation is carried out via the expectation–maximization (EM) algorithm or general EM algorithm. Several parameterization strategies are proposed and analytical or computational closed form EM update equations are derived for each. We apply the methodology to a model based on an advection–diffusion partial differential equation in a simulation study and also to a dimension-reduced model for a Palmer Drought Severity Index (PDSI) data set.  相似文献   

10.
We consider the statistical evaluation and estimation of vaccine efficacy when the protective effect wanes with time. We reanalyse data from a 5-year trial of two oral cholera vaccines in Matlab, Bangladesh. In this field trial, one vaccine appears to confer better initial protection than the other, but neither appears to offer protection for a period longer than about 3 years. Time-dependent vaccine effects are estimated by obtaining smooth estimates of a time-varying relative risk RR( t ) using survival analysis. We compare two approaches based on the Cox model in terms of their strategies for detecting time-varying vaccine effects, and their estimation techniques for obtaining a time-dependent RR( t ) estimate. These methods allow an exploration of time-varying vaccine effects while making minimal parametric assumptions about the functional form of RR( t ) for vaccinated compared wit unvaccinated subjects.  相似文献   

11.
Summary The detection of errors and outliers is an important step in data processing, especially those errors arising from data entry operations because they are of the entire responsability of the data processing staff. The duplicate performance method, is commonly used as an attempt to detect such type of errors. It implies typically typing twice the same data without any special precedence. If the errors are uniformly distributed among individuals, retyping a fraction of the total will also remove typically the same fraction of the errors. A new method is presented, which is able to improve that procedure by sorting the records putting first the most unlikely ones. The ability of the present methodology has been tested by a Monte Carlo simulation, using an existing database of categorical answers of housing characteristics in Uruguay. At first, it has been randomly contaiminated, and after that, the proposed procedure applied. The results show that if a partial retyping is done following the proposed order about 50 % of the errors can be removed while keeping the retyping effort between 4 and 14% of the dataset, while to attain a similar result with the standard methodology 50% (on, average) of the database should be processed. The new ordering is based upon the unrotated Principal Component Analysis (PCA) transformation of the previously coded data. No special shape of the multivariate distribution function is assumed or required.  相似文献   

12.
Inverse probability weighting (IPW) and multiple imputation are two widely adopted approaches dealing with missing data. The former models the selection probability, and the latter models data distribution. Consistent estimation requires correct specification of corresponding models. Although the augmented IPW method provides an extra layer of protection on consistency, it is usually not sufficient in practice as the true data‐generating process is unknown. This paper proposes a method combining the two approaches in the same spirit of calibration in sampling survey literature. Multiple models for both the selection probability and data distribution can be simultaneously accounted for, and the resulting estimator is consistent if any model is correctly specified. The proposed method is within the framework of estimating equations and is general enough to cover regression analysis with missing outcomes and/or missing covariates. Results on both theoretical and numerical investigation are provided.  相似文献   

13.
Summary The paper first provides a short review of the most common microeconometric models including logit, probit, discrete choice, duration models, models for count data and Tobit-type models. In the second part we consider the situation that the micro data have undergone some anonymization procedure which has become an important issue since otherwise confidentiality would not be guaranteed. We shortly describe the most important approaches for data protection which also can be seen as creating errors of measurement by purpose. We also consider the possibility of correcting the estimation procedure while taking into account the anonymization procedure. We illustrate this for the case of binary data which are anonymized by ‘post-randomization’ and which are used in a probit model. We show the effect of ‘naive’ estimation, i. e. when disregarding the anonymization procedure. We also show that a ‘corrected’ estimate is available which is satisfactory in statistical terms. This is also true if parameters of the anonymization procedure have to be estimated, too. Research in this paper is related to the project “Faktische Anonymisierung wirtschaftsstatistischer Einzeldaten” financed by German Ministry of Research and Technology.  相似文献   

14.
Identifying cost-effective decisions that can take into account of medical cost and health outcome is an important issue under very limited resources. Analyzing medical costs has been challenged owing to skewness of cost distributions, heterogeneity across samples and censoring. When censoring is due to administrative reasons, the total cost might be related to the survival time since longer survivals are likely to be censored and the corresponding total cost will be censored as well. This paper uses the general linear model for the longitudinal data to model the repeated medical cost data and the weighted estimating equation is used to find more accurate estimates for the parameter. Furthermore, the asymptotic properties for the proposed model are discussed. Simulations are used to evaluate the performance of estimators under various scenarios. Finally, the proposed model is implemented on the data extracted from National Health Insurance database for patients with the colorectal cancer.  相似文献   

15.
ABSTRACT

The Open Aid Malawi initiative has collected an unprecedented database that identifies as much location-specific information as possible for each of over 2500 individual foreign aid donations to Malawi since 2003. The efficient use and distribution of such aid is important to donors and to Malawi citizens. However, because of individual donor goals and difficulty in tracking donor coordination it is difficult to determine whether aid allocation is efficient. We compare several Bayesian spatial generalized linear mixed models to relate aid allocation to various economic indicators within seven donation sectors. We find that the spatial gamma regression model best predicts current aid allocation. While we are cautious about making strong claims based on this exploratory study, we provide a methodology by which one could (i) evaluate the efficiency of aid allocation via a study of the locations of current aid allocation as compared to the need at those locations and (ii) come up with a strategy for efficient allocation of resources in conditions where there exists an ideal relationship between aid allocation and economic sectors.  相似文献   

16.
ABSTRACT

In incident cohort studies, survival data often include subjects who have had an initiate event at recruitment and may potentially experience two successive events (first and second) during the follow-up period. When disease registries or surveillance systems collect data based on incidence occurring within a specific calendar time interval, the initial event is usually subject to double truncation. Furthermore, since the second duration process is observable only if the first event has occurred, double truncation and dependent censoring arise. In this article, under the two sampling biases with an unspecified distribution of truncation variables, we propose a nonparametric estimator of the joint survival function of two successive duration times using the inverse-probability-weighted (IPW) approach. The consistency of the proposed estimator is established. Based on the estimated marginal survival functions, we also propose a two-stage estimation procedure for estimating the parameters of copula model. The bootstrap method is used to construct confidence interval. Numerical studies demonstrate that the proposed estimation approaches perform well with moderate sample sizes.  相似文献   

17.
Multiple imputation is a common approach for dealing with missing values in statistical databases. The imputer fills in missing values with draws from predictive models estimated from the observed data, resulting in multiple, completed versions of the database. Researchers have developed a variety of default routines to implement multiple imputation; however, there has been limited research comparing the performance of these methods, particularly for categorical data. We use simulation studies to compare repeated sampling properties of three default multiple imputation methods for categorical data, including chained equations using generalized linear models, chained equations using classification and regression trees, and a fully Bayesian joint distribution based on Dirichlet process mixture models. We base the simulations on categorical data from the American Community Survey. In the circumstances of this study, the results suggest that default chained equations approaches based on generalized linear models are dominated by the default regression tree and Bayesian mixture model approaches. They also suggest competing advantages for the regression tree and Bayesian mixture model approaches, making both reasonable default engines for multiple imputation of categorical data. Supplementary material for this article is available online.  相似文献   

18.
Content recommendation on a webpage involves recommending content links (items) on multiple slots for each user visit to maximize some objective function, typically the click-through rate (CTR) which is the probability of clicking on an item for a given user visit. Most existing approaches to this problem assume user's response (click/no click) on different slots are independent of each other. This is problematic since in many scenarios CTR on a slot may depend on externalities like items recommended on other slots. Incorporating the effects of such externalities in the modeling process is important to better predictive accuracy. We therefore propose a hierarchical model that assumes a multinomial response for each visit to incorporate competition among slots and models complex interactions among (user, item, slot) combinations through factor models via a tensor approach. In addition, factors in our model are drawn with means that are based on regression functions of user/item covariates, which helps us obtain better estimates for users/items that are relatively new with little past activity. We show marked gains in predictive accuracy by various metrics.  相似文献   

19.
Given a general homogeneous non-stationary autoregressive integrated moving average process ARIMA(p,d,q), the corresponding model for the subseries obtained by a systematic sampling is derived. The article then shows that the sampled subseries approaches approximately to an integrated moving average process IMA(d,l), l≤(d-l), regardless of the autoregressive and moving average structures in the original series. In particular, the sampled subseries from an ARIMA (p,l,q) process approaches approximately to a simple random walk model.  相似文献   

20.
The accelerated failure time (AFT) models have proved useful in many contexts, though heavy censoring (as for example in cancer survival) and high dimensionality (as for example in microarray data) cause difficulties for model fitting and model selection. We propose new approaches to variable selection for censored data, based on AFT models optimized using regularized weighted least squares. The regularized technique uses a mixture of \(\ell _1\) and \(\ell _2\) norm penalties under two proposed elastic net type approaches. One is the adaptive elastic net and the other is weighted elastic net. The approaches extend the original approaches proposed by Ghosh (Adaptive elastic net: an improvement of elastic net to achieve oracle properties, Technical Reports 2007) and Hong and Zhang (Math Model Nat Phenom 5(3):115–133 2010), respectively. We also extend the two proposed approaches by adding censoring observations as constraints into their model optimization frameworks. The approaches are evaluated on microarray and by simulation. We compare the performance of these approaches with six other variable selection techniques-three are generally used for censored data and the other three are correlation-based greedy methods used for high-dimensional data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号