期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Generating synthetic data to produce public-use microdata for small geographic areas based on complex sample survey data with application to the National Health Interview Survey

Joseph W. Sakshaug Trivellore E. Raghunathan 《Journal of applied statistics》2014,41(10):2103-2122

Small area statistics obtained from sample survey data provide a critical source of information used to study health, economic, and sociological trends. However, most large-scale sample surveys are not designed for the purpose of producing small area statistics. Moreover, data disseminators are prevented from releasing public-use microdata for small geographic areas for disclosure reasons; thus, limiting the utility of the data they collect. This research evaluates a synthetic data method, intended for data disseminators, for releasing public-use microdata for small geographic areas based on complex sample survey data. The method replaces all observed survey values with synthetic (or imputed) values generated from a hierarchical Bayesian model that explicitly accounts for complex sample design features, including stratification, clustering, and sampling weights. The method is applied to restricted microdata from the National Health Interview Survey and synthetic data are generated for both sampled and non-sampled small areas. The analytic validity of the resulting small area inferences is assessed by direct comparison with the actual data, a simulation study, and a cross-validation study. 相似文献

2.

Significance tests for multi-component estimands from multiply imputed,synthetic microdata

《Journal of statistical planning and inference》2005,131(2):365-377

To limit the risks of disclosures when releasing data to the public, it has been suggested that statistical agencies release multiply imputed, synthetic microdata. For example, the released microdata can be fully synthetic, comprising random samples of units from the sampling frame with simulated values of variables. Or, the released microdata can be partially synthetic, comprising the units originally surveyed with some collected values, e.g. sensitive values at high risk of disclosure or values of key identifiers, replaced with multiple imputations. This article presents inferential methods for synthetic data for multi-component estimands, in particular procedures for Wald and likelihood ratio tests. The performance of the procedures is illustrated with simulation studies. 相似文献

3.

Synthetic two-way contingency tables that preserve conditional frequencies

Aleksandra B. Slavković Juyoun Lee 《Statistical Methodology》2010,7(3):225-239

In the area of statistical limitation, releasing synthetic data sets has become a popular method for limiting the risks of disclosure of sensitive information and at the same time maintaining analytic utility of data. However, less work has been done on how to create synthetic contingency tables that preserve some summary statistics of the original table. Studies in this area have primarily focused on generating replacement tables that preserve the margins of the original table since the latter support statistical inferences for a large set of parametric tests and models. Yet, not all synthetic tables that preserve a set of margins yield consistent results. In this paper, we propose alternative synthetic table releases. We describe how to generate complete two-way contingency tables that have the same set of observed conditional frequencies by using tools from computational algebra. We study both the disclosure risk and the data utility associated with such synthetic tabular data releases, and compare them to the traditionally released synthetic tables. 相似文献

4.

Microdata disclosure by resampling – Empirical findings for business survey data

Sandra Gottschalk 《Allgemeines Statistisches Archiv》2004,88(3):279-302

Summary: One specific problem statistical offices and research institutes are faced with when releasing microdata is the preservation of confidentiality. Traditional methods to avoid disclosure often destroy the structure of the data, and information loss is potentially high. In this paper an alternative technique of creating scientific–use files is discussed, which reproduces the characteristics of the original data quite well. It is based on Fienberg (1997, 1994) who estimates and resamples from the empirical multivariate cumulative distribution function of the data in order to get synthetic data. The procedure creates data sets – the resample – which have the same characteristics as the original survey data. The paper includes some applications of this method with (a) simulated data and (b) innovation survey data, the Mannheim Innovation Panel (MIP), and a comparison between resampling and a common method of disclosure control (disturbance with multiplicative error) with regard to confidentiality on the one hand and the appropriateness of the disturbed data for different kinds of analyses on the other. The results show that univariate distributions can be better reproduced by unweighted resampling. Parameter estimates can be reproduced quite well if the resampling procedure implements the correlation structure of the original data as a scale or if the data is multiplicatively perturbed and a correction term is used. On average, anonymization of data with multiplicatively perturbed values protects better against re–identification than the various resampling methods used. 相似文献

5.

Simultaneous edit-imputation and disclosure limitation for business establishment data

Hang J. Kim Jerome P. Reiter Alan F. Karr 《Journal of applied statistics》2018,45(1):63-82

Business establishment microdata typically are required to satisfy agency-specified edit rules, such as balance equations and linear inequalities. Inevitably some establishments' reported data violate the edit rules. Statistical agencies correct faulty values using a process known as edit-imputation. Business establishment data also must be heavily redacted before being shared with the public; indeed, confidentiality concerns lead many agencies not to share establishment microdata as unrestricted access files. When microdata must be heavily redacted, one approach is to create synthetic data, as done in the U.S. Longitudinal Business Database and the German IAB Establishment Panel. This article presents the first implementation of a fully integrated approach to edit-imputation and data synthesis. We illustrate the approach on data from the U.S. Census of Manufactures and present a variety of evaluations of the utility of the synthetic data. The paper also presents assessments of disclosure risks for several intruder attacks. We find that the synthetic data preserve important distributional features from the post-editing confidential microdata, and have low risks for the various attacks. 相似文献

6.

Single-index regression models with right-censored responses

Olivier Lopez 《Journal of statistical planning and inference》2009

In this article, we propose some new generalizations of M-estimation procedures for single-index regression models in presence of randomly right-censored responses. We derive consistency and asymptotic normality of our estimates. The results are proved in order to be adapted to a wide range of techniques used in a censored regression framework (e.g. synthetic data or weighted least squares). As in the uncensored case, the estimator of the single-index parameter is seen to have the same asymptotic behavior as in a fully parametric scheme. We compare these new estimators with those based on the average derivative technique of Lu and Burke [2005. Censored multiple regression by the method of average derivatives. J. Multivariate Anal. 95, 182–205] through a simulation study. 相似文献

7.

Nonlinear Censored Regression Using Synthetic Data

MICHEL DELECROIX OLIVIER LOPEZ VALENTIN PATILEA 《Scandinavian Journal of Statistics》2008,35(2):248-265

Abstract. The problem of estimating a nonlinear regression model, when the dependent variable is randomly censored, is considered. The parameter of the model is estimated by least squares using synthetic data. Consistency and asymptotic normality of the least squares estimators are derived. The proofs are based on a novel approach that uses i.i.d. representations of synthetic data through Kaplan–Meier integrals. The asymptotic results are supported by a small simulation study. 相似文献

8.

A robust automatic clustering algorithm for probability density functions with application to categorizing color images

J. H. Chen Y. C. Chang 《统计学通讯:模拟与计算》2018,47(7):2152-2168

This study develops a robust automatic algorithm for clustering probability density functions based on the previous research. Unlike other existing methods that often pre-determine the number of clusters, this method can self-organize data groups based on the original data structure. The proposed clustering method is also robust in regards to noise. Three examples of synthetic data and a real-world COREL dataset are utilized to illustrate the accurateness and effectiveness of the proposed approach. 相似文献

9.

What Level of Statistical Model Should We Use in Small Area Estimation?

下载免费PDF全文

Mohammad‐Reza Namazi‐Rad David Steel 《Australian & New Zealand Journal of Statistics》2015,57(2):275-298

If unit‐level data are available, small area estimation (SAE) is usually based on models formulated at the unit level, but they are ultimately used to produce estimates at the area level and thus involve area‐level inferences. This paper investigates the circumstances under which using an area‐level model may be more effective. Linear mixed models (LMMs) fitted using different levels of data are applied in SAE to calculate synthetic estimators and empirical best linear unbiased predictors (EBLUPs). The performance of area‐level models is compared with unit‐level models when both individual and aggregate data are available. A key factor is whether there are substantial contextual effects. Ignoring these effects in unit‐level working models can cause biased estimates of regression parameters. The contextual effects can be automatically accounted for in the area‐level models. Using synthetic and EBLUP techniques, small area estimates based on different levels of LMMs are investigated in this paper by means of a simulation study. 相似文献

10.

A new synthetic control chart for monitoring process mean using auxiliary information

《Journal of Statistical Computation and Simulation》2012,82(15):3068-3092

ABSTRACT

Quality control charts have been widely recognized as a potentially powerful statistical process monitoring tool in statistical process control because of their superior ability in detecting shifts in the process parameters. Recently, auxiliary-information-based control charts have been proposed and shown to have excellent speed in detecting process shifts than those based without it. In this paper, we design a new synthetic control chart that is based on a statistic that utilizes information from both the study and auxiliary variables. The proposed synthetic chart encompasses the classical synthetic chart. The construction, optimal design, run length profiles, and the performance evaluation of the new chart are discussed in detail. It turns out that the proposed synthetic chart performs uniformly better than the classical synthetic chart when detecting different kinds of shifts in the process mean under both zero-state and steady-state run length performances. Moreover, with reasonable assumptions, the proposed chart also surpasses the exponentially weighted moving average control chart. An application with a simulated data set is also presented to explain the implementation of the proposed control chart. 相似文献

11.

Detection of outlying proportions

Flavio Mignone 《Journal of applied statistics》2018,45(8):1382-1395

In this paper we introduce a new method for detecting outliers in a set of proportions. It is based on the construction of a suitable two-way contingency table and on the application of an algorithm for the detection of outlying cells in such table. We exploit the special structure of the relevant contingency table to increase the efficiency of the method. The main properties of our algorithm, together with a guide for the choice of the parameters, are investigated through simulations, and in simple cases some theoretical justifications are provided. Several examples on synthetic data and an example based on pseudo-real data from biological experiments demonstrate the good performances of our algorithm. 相似文献

12.

Two preprocessing algorithms for climate time series

Stephan Schlüter Milena Kresoja 《Journal of applied statistics》2020,47(11):1970

We propose two preprocessing algorithms suitable for climate time series. The first algorithm detects outliers based on an autoregressive cost update mechanism. The second one is based on the wavelet transform, a method from pattern recognition. In order to benchmark the algorithms'' performance we compare them to existing methods based on a synthetic data set. Eventually, for exemplary purposes, the proposed methods are applied to a data set of high-frequent temperature measurements from Novi Sad, Serbia. The results show that both methods together form a powerful tool for signal preprocessing: In case of solitary outliers the autoregressive cost update mechanism prevails, whereas the wavelet-based mechanism is the method of choice in the presence of multiple consecutive outliers. 相似文献

13.

Penalized calibration in survey sampling: Design-based estimation assisted by mixed models

Fabien Guggemos Yves Tillé 《Journal of statistical planning and inference》2010

Calibration techniques in survey sampling, such as generalized regression estimation (GREG), were formalized in the 1990s to produce efficient estimators of linear combinations of study variables, such as totals or means. They implicitly lie on the assumption of a linear regression model between the variable of interest and some auxiliary variables in order to yield estimates with lower variance if the model is true and remaining approximately design-unbiased even if the model does not hold. We propose a new class of model-assisted estimators obtained by releasing a few calibration constraints and replacing them with a penalty term. This penalization is added to the distance criterion to minimize. By introducing the concept of penalized calibration, combining usual calibration and this ‘relaxed’ calibration, we are able to adjust the weight given to the available auxiliary information. We obtain a more flexible estimation procedure giving better estimates particularly when the auxiliary information is overly abundant or not fully appropriate to be completely used. Such an approach can also be seen as a design-based alternative to the estimation procedures based on the more general class of mixed models, presenting new prospects in some scopes of application such as inference on small domains. 相似文献

14.

电影首映日后票房预测模型研究 总被引：1，自引：0，他引：1

罗晓芃齐佳音田春华《统计与信息论坛》2016,(11):94-102

利用285部电影的截面数据进行单样本t检验,发现票房过亿电影的首映日票房都不少于200万元。抽取138部首映日票房超过200万元的电影的21天数据构建动态面板模型,采用两步系统GMM估计建立票房预测模型。研究表明:前一日票房对后一日票房具有显著正向影响,即在一段时间内前一日票房增加将提升后一日票房。票价较高和较低都会对票房产生积极影响,因此将票价纳入票房预测模型中。电影类型、上映日期、上映档期、出品国别、续集、网络评分、点映活动均被证明对票房具有正向影响。相较其他研究,包含上述指标的预测模型准确率得到了大幅提高。相似文献

15.

Semiparametric estimation and inference for distributional and general treatment effects

Jing Cheng Jing Qin Biao Zhang 《Journal of the Royal Statistical Society. Series B, Statistical methodology》2009,71(4):881-904

Summary. There is a large literature on methods of analysis for randomized trials with noncompliance which focuses on the effect of treatment on the average outcome. The paper considers evaluating the effect of treatment on the entire distribution and general functions of this effect. For distributional treatment effects, fully non-parametric and fully parametric approaches have been proposed. The fully non-parametric approach could be inefficient but the fully parametric approach is not robust to the violation of distribution assumptions. We develop a semiparametric instrumental variable method based on the empirical likelihood approach. Our method can be applied to general outcomes and general functions of outcome distributions and allows us to predict a subject's latent compliance class on the basis of an observed outcome value in observed assignment and treatment received groups. Asymptotic results for the estimators and likelihood ratio statistic are derived. A simulation study shows that our estimators of various treatment effects are substantially more efficient than the currently used fully non-parametric estimators. The method is illustrated by an analysis of data from a randomized trial of an encouragement intervention to improve adherence to prescribed depression treatments among depressed elderly patients in primary care practices. 相似文献

16.

教师课堂教学质量综合评价方法研究 总被引：2，自引：0，他引：2

陈正《统计教育》2004,(6):51-53

教师课堂教学质量综合评价以课堂教学质量评价指标体系为基础,综合评价方法主要包括指标加权平均法和模糊综合评价法。难点是评价指标的设计和量化处理,权集的构造。对评价结果的分析可采用绝对评价(等级判断),相对评价,动态评价(个体差异的评价)。相似文献

17.

Estimation and Classification of BOLD Responses Over Multiple Trials

Kapur K Roy A Bhaumik DK Gibbons RD Lazar NA Sweeney JA Aryal S Patterson D 《统计学通讯:理论与方法》2009,38(16-17):3099-3113

In this article, we model functional magnetic resonance imaging (fMRI) data for event-related experiment data using a fourth degree spline to fit voxel specific blood oxygenation level-dependent (BOLD) responses. The data are preprocessed for removing long term temporal components such as drifts using wavelet approximations. The spatial dependence is incorporated in the data by the application of 3D Gaussian spatial filter. The methodology assigns an activation score to each trial based on the voxel specific characteristics of the response curve. The proposed procedure has a capability of being fully automated and it produces activation images based on overall scores assigned to each voxel. The methodology is illustrated on real data from an event-related design experiment of visually guided saccades (VGS). 相似文献

18.

Statistical methods for regular monitoring data

Michael L. Stein 《Journal of the Royal Statistical Society. Series B, Statistical methodology》2005,67(5):667-687

Summary. Meteorological and environmental data that are collected at regular time intervals on a fixed monitoring network can be usefully studied combining ideas from multiple time series and spatial statistics, particularly when there are little or no missing data. This work investigates methods for modelling such data and ways of approximating the associated likelihood functions. Models for processes on the sphere crossed with time are emphasized, especially models that are not fully symmetric in space–time. Two approaches to obtaining such models are described. The first is to consider a rotated version of fully symmetric models for which we have explicit expressions for the covariance function. The second is based on a representation of space–time covariance functions that is spectral in just the time domain and is shown to lead to natural partially nonparametric asymmetric models on the sphere crossed with time. Various models are applied to a data set of daily winds at 11 sites in Ireland over 18 years. Spectral and space–time domain diagnostic procedures are used to assess the quality of the fits. The spectral-in-time modelling approach is shown to yield a good fit to many properties of the data and can be applied in a routine fashion relative to finding elaborate parametric models that describe the space–time dependences of the data about as well. 相似文献

19.

A method for approximate fully bayesian analysis of parameters

Anne Randi Syversveen Håvard Rue 《统计学通讯:模拟与计算》2013,42(3):1147-1162

Stochastic modeling of the geology in petroleum reservoirs has become an important tool in order to investigate flow properties in the reservoir. The stochastic models used contain parameters which must be estimated based on observations and geological knowledge. The amount of data available is however quite limited due to high drilling costs etc., and the lack of data prevents the use of many of the standard data driven approaches to the parameter estimation problem. Modern simulation based methods using Markov chain Monte Carlo simulation, can however be used to do fully Bayesian analysis with respect to parameters in the reservoir model, with the drawback of relatively high computational costs. In this paper, we propose a simple, relatively fast approximate method for fully Bayesian analysis of the parameters. We illustrate the method on both simulated and real data using a two-dimensional marked point model for reservoir characterization. 相似文献

20.

Post-randomization for controlling identification risk in releasing microdata from general surveys

Cheng Zhang Tapan K. Nayak 《Journal of applied statistics》2021,48(3):455

Before releasing survey data, statistical agencies usually perturb the original data to keep each survey unit''s information confidential. One significant concern in releasing survey microdata is identity disclosure, which occurs when an intruder correctly identifies the records of a survey unit by matching the values of some key (or pseudo-identifying) variables. We examine a recently developed post-randomization method for a strict control of identification risks in releasing survey microdata. While that procedure well preserves the observed frequencies and hence statistical estimates in case of simple random sampling, we show that in general surveys, it may induce considerable bias in commonly used survey-weighted estimators. We propose a modified procedure that better preserves weighted estimates. The procedure is illustrated and empirically assessed with an application to a publicly available US Census Bureau data set. 相似文献