期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Simultaneous edit-imputation and disclosure limitation for business establishment data

Hang J. Kim Jerome P. Reiter Alan F. Karr 《Journal of applied statistics》2018,45(1):63-82

Business establishment microdata typically are required to satisfy agency-specified edit rules, such as balance equations and linear inequalities. Inevitably some establishments' reported data violate the edit rules. Statistical agencies correct faulty values using a process known as edit-imputation. Business establishment data also must be heavily redacted before being shared with the public; indeed, confidentiality concerns lead many agencies not to share establishment microdata as unrestricted access files. When microdata must be heavily redacted, one approach is to create synthetic data, as done in the U.S. Longitudinal Business Database and the German IAB Establishment Panel. This article presents the first implementation of a fully integrated approach to edit-imputation and data synthesis. We illustrate the approach on data from the U.S. Census of Manufactures and present a variety of evaluations of the utility of the synthetic data. The paper also presents assessments of disclosure risks for several intruder attacks. We find that the synthetic data preserve important distributional features from the post-editing confidential microdata, and have low risks for the various attacks. 相似文献

2.

Family violence: A microeconomic approach

Sharon K. Long D. Witte Patrice Karr 《Social science research》1983,12(4):363-392

A model of violence between adult family members is developed by integrating material from the sociological theories of family violence and social exchange, and the economic theories of crime and the family. Based on this model a decrease in the dictator's internal sanctions against violence would be expected to increase the amount of time allocated to violence by the dictator. Further, if the level of fines and other monetary costs imposed by external agencies (e.g., the courts) as a result of the family violence do not vary with the level of violence, then the model indicates that an increase in such monetary sanctions will cause a reduction in the amount of time the dictator allocates to violence. If both the dictator and victim are risk neutral, an increase in the probability of external intervention will decrease the time allocated to violence. In addition, it is found that increases in the opportunities available to the victim outside the marriage will tend to improve the well-being of the victim in the marriage even if it has no effect on the time allocated to violence by the dictator. The model also provides insights for empirical work in family violence such as (1) suggestions of relevant independent variables, (2) the specification of a functional form for estimation, and (3) the specification of an error structure for the empirical model. 相似文献

3.

Masking methods that preserve positivity constraints in microdata

Anna Oganian Alan F. Karr 《Journal of statistical planning and inference》2011,141(1):31-41

Statistical agencies have conflicting obligations to protect confidential information provided by respondents to surveys or censuses and to make data available for research and planning activities. When the microdata themselves are to be released, in order to achieve these conflicting objectives, statistical agencies apply statistical disclosure limitation (SDL) methods to the data, such as noise addition, swapping or microaggregation. Some of these methods do not preserve important structure and constraints in the data, such as positivity of some attributes or inequality constraints between attributes. Failure to preserve constraints is not only problematic in terms of data utility, but also may increase disclosure risk.In this paper, we describe a method for SDL that preserves both positivity of attributes and the mean vector and covariance matrix of the original data. The basis of the method is to apply multiplicative noise with the proper, data-dependent covariance structure. 相似文献

4.

Preserving data utility via BART

Xinlei Wang Alan F. Karr 《Journal of statistical planning and inference》2010

When preparing data for public release, information organizations face the challenge of preserving the quality of data while protecting the confidentiality of both data subjects and sensitive data attributes. Without knowing what type of analyses will be conducted by data users, it is often hard to alter data without sacrificing data utility. In this paper, we propose a new approach to mitigate this difficulty, which entails using Bayesian additive regression trees (BART), in connection with existing methods for statistical disclosure limitation, to help preserve data utility while meeting confidentiality requirements. We illustrate the performance of our method through both simulation and a data example. The method works well when the targeted relationship underlying the original data is not weak, and the performance appears to be robust to the intensity of alteration. 相似文献

5.

Preserving confidentiality of high-dimensional tabulated data: Statistical and computational issues

Dobra Adrian Karr Alan F. Sanil Ashish P. 《Statistics and Computing》2003,13(4):363-370

Dissemination of information derived from large contingency tables formed from confidential data is a major responsibility of statistical agencies. In this paper we present solutions to several computational and algorithmic problems that arise in the dissemination of cross-tabulations (marginal sub-tables) from a single underlying table. These include data structures that exploit sparsity to support efficient computation of marginals and algorithms such as iterative proportional fitting, as well as a generalized form of the shuttle algorithm that computes sharp bounds on (small, confidentiality threatening) cells in the full table from arbitrary sets of released marginals. We give examples illustrating the techniques. 相似文献

6.

Data quality: A statistical perspective 总被引：1，自引：0，他引：1

Alan F. Karr Ashish P. Sanil David L. Banks 《Statistical Methodology》2006,3(2):137-173

We present the old-but-new problem of data quality from a statistical perspective, in part with the goal of attracting more statisticians, especially academics, to become engaged in research on a rich set of exciting challenges. The data quality landscape is described, and its research foundations in computer science, total quality management and statistics are reviewed. Two case studies based on an EDA approach to data quality are used to motivate a set of research challenges for statistics that span theory, methodology and software tools. 相似文献

7.

Multiple Imputation of Missing or Faulty Values Under Linear Constraints

Hang J. Kim Jerome P. Reiter Quanli Wang Lawrence H. Cox Alan F. Karr 《商业与经济统计学杂志》2014,32(3):375-386

Many statistical agencies, survey organizations, and research centers collect data that suffer from item nonresponse and erroneous or inconsistent values. These data may be required to satisfy linear constraints, for example, bounds on individual variables and inequalities for ratios or sums of variables. Often these constraints are designed to identify faulty values, which then are blanked and imputed. The data also may exhibit complex distributional features, including nonlinear relationships and highly nonnormal distributions. We present a fully Bayesian, joint model for modeling or imputing data with missing/blanked values under linear constraints that (i) automatically incorporates the constraints in inferences and imputations, and (ii) uses a flexible Dirichlet process mixture of multivariate normal distributions to reflect complex distributional features. Our strategy for estimation is to augment the observed data with draws from a hypothetical population in which the constraints are not present, thereby taking advantage of computationally expedient methods for fitting mixture models. Missing/blanked items are sampled from their posterior distribution using the Hit-and-Run sampler, which guarantees that all imputations satisfy the constraints. We illustrate the approach using manufacturing data from Colombia, examining the potential to preserve joint distributions and a regression from the plant productivity literature. Supplementary materials for this article are available online. 相似文献