首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper we discuss a new theoretical basis for perturbation methods. In developing this new theoretical basis, we define the ideal measures of data utility and disclosure risk. Maximum data utility is achieved when the statistical characteristics of the perturbed data are the same as that of the original data. Disclosure risk is minimized if providing users with microdata access does not result in any additional information. We show that when the perturbed values of the confidential variables are generated as independent realizations from the distribution of the confidential variables conditioned on the non-confidential variables, they satisfy the data utility and disclosure risk requirements. We also discuss the relationship between the theoretical basis and some commonly used methods for generating perturbed values of confidential numerical variables.  相似文献   

2.
 实证研究离不开数据,当前,官方汇总数据日益成为一种公共产品,研究团体和社会公众有很多渠道获取。但是,由于技术、经济、法律、甚至是政治等种种因素的制约,微观统计数据共享和传播渠道缺失,迫使研究团体和个人自己去进行数据收集,造成大量的重复劳动和财力时间的浪费。同时,对于已有微观统计数据的开发不足,降低了数据收集的回报,严重制约了统计能力的提升。本文对微观数据发布的现状进行了中外比较,讨论了微观数据发布的效用与风险,指出最关键的问题是满足日益增长的数据需求和统计泄密风险的矛盾,并且介绍了当前国际上常用的控制泄密风险的方法,并最终结合实际情况对中国的微观数据发布提出相关的建议。  相似文献   

3.
When tables are generated from a data file, the release of those tables should not reveal too detailed information concerning individual respondents. The disclosure of individual respondents in the microdata file can be prevented by applying disclosure control methods at the table level (by cell suppression or cell perturbation), but this may create inconsistencies among other tables based on the same data file. Alternatively, disclosure control methods can be applied at the microdata level, but these methods may change the data permanently and do not account for specific table properties. These problems can be circumvented by assigning a (single and fixed) weight factor to each respondent/record in the microdata file. Normally this weight factor is equal to 1 for each record, and is not explicitly incorporated in the microdata file. Upon tabulation, each contribution of a respondent is weighted multiplicatively by the respondent's weight factor. This approach is called Source Data Perturbation (SDP) because the data is perturbed at the microdata level, not at the table level. It should be noted, however, that the data in the original microdata is not changed; only a weight variable is added. The weight factors can be chosen in accordance with the SDC paradigm, i.e. such that the tables generated from the microdata are safe, and the information loss is minimized. The paper indicates how this can be done. Moreover it is shown that the SDP approach is very suitable for use in data warehouses, as the weights can be conveniently put in the fact tables. The data can then still be accessed and sliced and diced up to a certain level of detail, and tables generated from the data warehouse are mutually consistent and safe.  相似文献   

4.
Summary. Protection against disclosure is important for statistical agencies releasing microdata files from sample surveys. Simple measures of disclosure risk can provide useful evidence to support decisions about release. We propose a new measure of disclosure risk: the probability that a unique match between a microdata record and a population unit is correct. We argue that this measure has at least two advantages. First, we suggest that it may be a more realistic measure of risk than two measures that are currently used with census data. Second, we show that consistent inference (in a specified sense) may be made about this measure from sample data without strong modelling assumptions. This is a surprising finding, in its contrast with the properties of the two 'similar' established measures. As a result, this measure has potentially useful applications to sample surveys. In addition to obtaining a simple consistent predictor of the measure, we propose a simple variance estimator and show that it is consistent. We also consider the extension of inference to allow for certain complex sampling schemes. We present a numerical study based on 1991 census data for about 450 000 enumerated individuals in one area of Great Britain. We show that the theoretical results on the properties of the point predictor of the measure of risk and its variance estimator hold to a good approximation for these data.  相似文献   

5.
To protect public-use microdata, one approach is not to allow users access to the microdata. Instead, users submit analyses to a remote computer that reports back basic output from the fitted model, such as coefficients and standard errors. To be most useful, this remote server also should provide some way for users to check the fit of their models, without disclosing actual data values. This paper discusses regression diagnostics for remote servers. The proposal is to release synthetic diagnostics—i.e. simulated values of residuals and dependent and independent variables–constructed to mimic the relationships among the real-data residuals and independent variables. Using simulations, it is shown that the proposed synthetic diagnostics can reveal model inadequacies without substantial increase in the risk of disclosures. This approach also can be used to develop remote server diagnostics for generalized linear models.  相似文献   

6.
Summary.  The paper establishes a correspondence between statistical disclosure control and forensic statistics regarding their common use of the concept of 'probability of identification'. The paper then seeks to investigate what lessons for disclosure control can be learnt from the forensic identification literature. The main lesson that is considered is that disclosure risk assessment cannot, in general, ignore the search method that is employed by an intruder seeking to achieve disclosure. The effects of using several search methods are considered. Through consideration of the plausibility of assumptions and 'worst case' approaches, the paper suggests how the impact of search method can be handled. The paper focuses on foundations of disclosure risk assessment, providing some justification for some modelling assumptions underlying some existing record level measures of disclosure risk. The paper illustrates the effects of using various search methods in a numerical example based on microdata from a sample from the 2001 UK census.  相似文献   

7.
Before releasing survey data, statistical agencies usually perturb the original data to keep each survey unit''s information confidential. One significant concern in releasing survey microdata is identity disclosure, which occurs when an intruder correctly identifies the records of a survey unit by matching the values of some key (or pseudo-identifying) variables. We examine a recently developed post-randomization method for a strict control of identification risks in releasing survey microdata. While that procedure well preserves the observed frequencies and hence statistical estimates in case of simple random sampling, we show that in general surveys, it may induce considerable bias in commonly used survey-weighted estimators. We propose a modified procedure that better preserves weighted estimates. The procedure is illustrated and empirically assessed with an application to a publicly available US Census Bureau data set.  相似文献   

8.
Under given concrete exogenous conditions, the fraction of identifiable records in a microdata file without positive identifiers such as name and address is estimated. The effect of possible noise in the data, as well as the sample property of microdata files, is taken into account. Using real microdata files, it is shown that there is no risk of disclosure if the information content of characteristics known to the investigator (additional knowledge) is limited. Files with additional knowledge of large information content yield a high risk of disclosure. This can be eliminated only by massive modifications of the data records, which, however, involve large biases for complex statistical evaluations. In this case, the requirement for privacy protection and high-quality data perhaps may be fulfilled only if the linkage of such files with extensive additional knowledge is prevented by appropriate organizational and legal restrictions.  相似文献   

9.
Statistical agencies have conflicting obligations to protect confidential information provided by respondents to surveys or censuses and to make data available for research and planning activities. When the microdata themselves are to be released, in order to achieve these conflicting objectives, statistical agencies apply statistical disclosure limitation (SDL) methods to the data, such as noise addition, swapping or microaggregation. Some of these methods do not preserve important structure and constraints in the data, such as positivity of some attributes or inequality constraints between attributes. Failure to preserve constraints is not only problematic in terms of data utility, but also may increase disclosure risk.In this paper, we describe a method for SDL that preserves both positivity of attributes and the mean vector and covariance matrix of the original data. The basis of the method is to apply multiplicative noise with the proper, data-dependent covariance structure.  相似文献   

10.
To limit the risks of disclosures when releasing data to the public, it has been suggested that statistical agencies release multiply imputed, synthetic microdata. For example, the released microdata can be fully synthetic, comprising random samples of units from the sampling frame with simulated values of variables. Or, the released microdata can be partially synthetic, comprising the units originally surveyed with some collected values, e.g. sensitive values at high risk of disclosure or values of key identifiers, replaced with multiple imputations. This article presents inferential methods for synthetic data for multi-component estimands, in particular procedures for Wald and likelihood ratio tests. The performance of the procedures is illustrated with simulation studies.  相似文献   

11.
In this paper we discuss methodology for the safe release of business microdata. In particular we extend the model-based protection procedure of Franconi and Stander (2002, The Statistician 51: 1–11) by allowing the model to take account of the spatial structure underlying the geographical information in the microdata. We discuss the use of the Gibbs sampler for performing the computations required by this spatial approach. We provide an empirical comparison of these non-spatial and spatial disclosure limitation methods based on the Italian sample from the Community Innovation Survey. We quantify the level of protection achieved for the released microdata and the error induced when various inferences are performed. We find that although the spatial method often induces higher inferential errors, it almost always provides more protection. Moreover the aggregated areas from the spatial procedure can be somewhat more spatially smooth, and hence possibly more meaningful, than those from the non-spatial approach. We discuss possible applications of these model-based protection procedures to more spatially extensive data sets.  相似文献   

12.
Disseminating microdata to the public that provide a high level of data utility, while at the same time guaranteeing the confidentiality of the survey respondent is a difficult task. Generating multiply imputed synthetic datasets is an innovative statistical disclosure limitation technique with the potential of enabling the data disseminating agency to achieve this twofold goal. So far, the approach was successfully implemented only for a limited number of datasets in the U.S. In this paper, we present the first successful implementation outside the U.S.: the generation of partially synthetic datasets for an establishment panel survey at the German Institute for Employment Research. We describe the whole evolution of the project: from the early discussions concerning variables at risk to the final synthesis. We also present our disclosure risk evaluations and provide some first results on the data utility of the generated datasets. A variance-inflated imputation model is introduced that incorporates additional variability in the model for records that are not sufficiently protected by the standard synthesis.  相似文献   

13.
Statistical disclosure control (SDC) is a balancing act between mandatory data protection and the comprehensible demand from researchers for access to original data. In this paper, a family of methods is defined to ‘mask’ sensitive variables before data files can be released. In the first step, the variable to be masked is ‘cloned’ (C). Then, the duplicated variable as a whole or just a part of it is ‘suppressed’ (S). The masking procedure's third step ‘imputes’ (I) data for these artificial missings. Then, the original variable can be deleted and its masked substitute has to serve as the basis for the analysis of data. The idea of this general ‘CSI framework’ is to open the wide field of imputation methods for SDC. The method applied in the I-step can make use of available auxiliary variables including the original variable. Different members of this family of methods delivering variance estimators are discussed in some detail. Furthermore, a simulation study analyzes various methods belonging to the family with respect to both, the quality of parameter estimation and privacy protection. Based on the results obtained, recommendations are formulated for different estimation tasks.  相似文献   

14.
15.
The paper proposes a new disclosure limitation procedure based on simulation. The key feature of the proposal is to protect actual microdata by drawing artificial units from a probability model, that is estimated from the observed data. Such a model is designed to maintain selected characteristics of the empirical distribution, thus providing a partial representation of the latter. The characteristics we focus on are the expected values of a set of functions; these are constrained to be equal to their corresponding sample averages; the simulated data, then, reproduce on average the sample characteristics. If the set of constraints covers the parameters of interest of a user, information loss is controlled for, while, as the model does not preserve individual values, re-identification attempts are impaired-synthetic individuals correspond to actual respondents with very low probability.Disclosure is mainly discussed from the viewpoint of record re-identification. According to this definition, as the pledge for confidentiality only involves the actual respondents, release of synthetic units should in principle rule out the concern for confidentiality.The simulation model is built on the Italian sample from the Community Innovation Survey (CIS). The approach can be applied in more generality, and especially suits quantitative traits. The model has a semi-parametric component, based on the maximum entropy principle, and, here, a parametric component, based on regression. The maximum entropy principle is exploited to match data traits; moreover, entropy measures uncertainty of a distribution: its maximisation leads to a distribution which is consistent with the given information but is maximally noncommittal with regard to missing information.Application results reveal that the fixed characteristics are sustained, and other features such as marginal distributions are well represented. Model specification is clearly a major point; related issues are selection of characteristics, goodness of fit and strength of dependence relations.  相似文献   

16.
Business establishment microdata typically are required to satisfy agency-specified edit rules, such as balance equations and linear inequalities. Inevitably some establishments' reported data violate the edit rules. Statistical agencies correct faulty values using a process known as edit-imputation. Business establishment data also must be heavily redacted before being shared with the public; indeed, confidentiality concerns lead many agencies not to share establishment microdata as unrestricted access files. When microdata must be heavily redacted, one approach is to create synthetic data, as done in the U.S. Longitudinal Business Database and the German IAB Establishment Panel. This article presents the first implementation of a fully integrated approach to edit-imputation and data synthesis. We illustrate the approach on data from the U.S. Census of Manufactures and present a variety of evaluations of the utility of the synthetic data. The paper also presents assessments of disclosure risks for several intruder attacks. We find that the synthetic data preserve important distributional features from the post-editing confidential microdata, and have low risks for the various attacks.  相似文献   

17.
Summary: One specific problem statistical offices and research institutes are faced with when releasing microdata is the preservation of confidentiality. Traditional methods to avoid disclosure often destroy the structure of the data, and information loss is potentially high. In this paper an alternative technique of creating scientific–use files is discussed, which reproduces the characteristics of the original data quite well. It is based on Fienberg (1997, 1994) who estimates and resamples from the empirical multivariate cumulative distribution function of the data in order to get synthetic data. The procedure creates data sets – the resample – which have the same characteristics as the original survey data. The paper includes some applications of this method with (a) simulated data and (b) innovation survey data, the Mannheim Innovation Panel (MIP), and a comparison between resampling and a common method of disclosure control (disturbance with multiplicative error) with regard to confidentiality on the one hand and the appropriateness of the disturbed data for different kinds of analyses on the other. The results show that univariate distributions can be better reproduced by unweighted resampling. Parameter estimates can be reproduced quite well if the resampling procedure implements the correlation structure of the original data as a scale or if the data is multiplicatively perturbed and a correction term is used. On average, anonymization of data with multiplicatively perturbed values protects better against re–identification than the various resampling methods used.  相似文献   

18.
Summary Microaggregation by individual ranking is one of themost commonly applied disclosure control techniques for continuous microdata. The paper studies the effect of microaggregation by individual ranking on the least squares estimation of a multiple linear regression model. It is shown that the traditional least squares estimates are asymptotically unbiased. Moreover, the least squares estimates asymptotically have the same variances as the least squares estimates based on the original (non-aggregated) data. Thus, asymptotically, microaggregation by individual ranking does not result in a loss of efficiency in the least squares estimation of a multiple linear regression model. I thank Hans Schneeweiss for very helpful discussions and comments. Financial support from the Deutsche Forschungsgemeinschaft (German Science Foundation) is gratefully acknowledged.  相似文献   

19.
The technique of data suppression for protecting sensitive information in a two-dimensional table from exact disclosure raises the computational problems of testing a given table of censored data for security, and searching for a secure suppression pattern of minimum size for a given table. We provide a polynomial security test to solve the former problem, and prove that the latter problem is intractable in the general case, but can be solved in linear time in the special case in which only sensitive cells are to be protected.  相似文献   

20.
"The census of population represents a rich source of social data. Other countries have released samples of anonymized records from their censuses to the research community for secondary analysis. So far this has not been done in Britain. The areas of research which might be expected to benefit from such microdata are outlined, and support is drawn from considering experience overseas. However, it is essential to protect the confidentiality of the data. The paper therefore considers the risks, both real and perceived, of identification of individuals from census microdata. The conclusion of the paper is that the potential benefits from census microdata are large and that the risks in terms of disclosure are very small. The authors therefore argue that the Office of Population Censuses and Surveys and the General Register Office of Scotland should release samples of anonymized records from the 1991 census for secondary analysis."  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号