首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
"The census of population represents a rich source of social data. Other countries have released samples of anonymized records from their censuses to the research community for secondary analysis. So far this has not been done in Britain. The areas of research which might be expected to benefit from such microdata are outlined, and support is drawn from considering experience overseas. However, it is essential to protect the confidentiality of the data. The paper therefore considers the risks, both real and perceived, of identification of individuals from census microdata. The conclusion of the paper is that the potential benefits from census microdata are large and that the risks in terms of disclosure are very small. The authors therefore argue that the Office of Population Censuses and Surveys and the General Register Office of Scotland should release samples of anonymized records from the 1991 census for secondary analysis."  相似文献   

2.
In 1991 Marsh and co-workers made the case for a sample of anonymized records (SAR) from the 1991 census of population. The case was accepted by the Office for National Statistics (then the Office of Population Censuses and Surveys) and a request was made by the Economic and Social Research Council to purchase the SARs. Two files were released for Great Britain—a 2% sample of individuals and a 1% sample of households. Subsequently similar samples were released for Northern Ireland. Since their release, the files have been heavily used for research and there has been no known breach of confidentiality. There is a considerable demand for similar files from the 2001 census, with specific requests for a larger sample size and lower population threshold for the individual SAR. This paper reassesses the analysis of Marsh and co-workers of the risk of identification of an individual or household in a sample of microdata from the 1991 census and also uses alternative ways of assessing risks with the 1991 SARs. The results of both the reassessment and the new analyses are reassuring and allow us to take the 1991 SARs as a base-line against which to assess proposals for changes to the size and structure of samples from the 2001 census.  相似文献   

3.
When tables are generated from a data file, the release of those tables should not reveal too detailed information concerning individual respondents. The disclosure of individual respondents in the microdata file can be prevented by applying disclosure control methods at the table level (by cell suppression or cell perturbation), but this may create inconsistencies among other tables based on the same data file. Alternatively, disclosure control methods can be applied at the microdata level, but these methods may change the data permanently and do not account for specific table properties. These problems can be circumvented by assigning a (single and fixed) weight factor to each respondent/record in the microdata file. Normally this weight factor is equal to 1 for each record, and is not explicitly incorporated in the microdata file. Upon tabulation, each contribution of a respondent is weighted multiplicatively by the respondent's weight factor. This approach is called Source Data Perturbation (SDP) because the data is perturbed at the microdata level, not at the table level. It should be noted, however, that the data in the original microdata is not changed; only a weight variable is added. The weight factors can be chosen in accordance with the SDC paradigm, i.e. such that the tables generated from the microdata are safe, and the information loss is minimized. The paper indicates how this can be done. Moreover it is shown that the SDP approach is very suitable for use in data warehouses, as the weights can be conveniently put in the fact tables. The data can then still be accessed and sliced and diced up to a certain level of detail, and tables generated from the data warehouse are mutually consistent and safe.  相似文献   

4.
5.
National statistical agencies and other data custodians collect and hold a vast amount of survey and census data, containing information vital for research and policy analysis. However, the problem of allowing analysis of these data, while protecting respondent confidentiality, has proved challenging to address. In this paper we will focus on the remote analysis approach, under which a confidential dataset is held in a secure environment under the direct control of the data custodian agency. A computer system within the secure environment accepts a query from an analyst, runs it on the data, then returns the results to the analyst. In particular, the analyst does not have direct access to the data at all, and cannot view any microdata records. We further focus on the fitting of linear regression models to confidential data in the presence of outliers and influential points, such as are often present in business data. We propose a new method for protecting confidentiality in linear regression via a remote analysis system, that provides additional confidentiality protection for outliers and influential points in the data. The method we describe in this paper was designed for the prototype DataAnalyser system developed by the Australian Bureau of Statistics, however the method would be suitable for similar remote analysis systems.  相似文献   

6.
Under given concrete exogenous conditions, the fraction of identifiable records in a microdata file without positive identifiers such as name and address is estimated. The effect of possible noise in the data, as well as the sample property of microdata files, is taken into account. Using real microdata files, it is shown that there is no risk of disclosure if the information content of characteristics known to the investigator (additional knowledge) is limited. Files with additional knowledge of large information content yield a high risk of disclosure. This can be eliminated only by massive modifications of the data records, which, however, involve large biases for complex statistical evaluations. In this case, the requirement for privacy protection and high-quality data perhaps may be fulfilled only if the linkage of such files with extensive additional knowledge is prevented by appropriate organizational and legal restrictions.  相似文献   

7.
Before releasing survey data, statistical agencies usually perturb the original data to keep each survey unit''s information confidential. One significant concern in releasing survey microdata is identity disclosure, which occurs when an intruder correctly identifies the records of a survey unit by matching the values of some key (or pseudo-identifying) variables. We examine a recently developed post-randomization method for a strict control of identification risks in releasing survey microdata. While that procedure well preserves the observed frequencies and hence statistical estimates in case of simple random sampling, we show that in general surveys, it may induce considerable bias in commonly used survey-weighted estimators. We propose a modified procedure that better preserves weighted estimates. The procedure is illustrated and empirically assessed with an application to a publicly available US Census Bureau data set.  相似文献   

8.
Disseminating microdata to the public that provide a high level of data utility, while at the same time guaranteeing the confidentiality of the survey respondent is a difficult task. Generating multiply imputed synthetic datasets is an innovative statistical disclosure limitation technique with the potential of enabling the data disseminating agency to achieve this twofold goal. So far, the approach was successfully implemented only for a limited number of datasets in the U.S. In this paper, we present the first successful implementation outside the U.S.: the generation of partially synthetic datasets for an establishment panel survey at the German Institute for Employment Research. We describe the whole evolution of the project: from the early discussions concerning variables at risk to the final synthesis. We also present our disclosure risk evaluations and provide some first results on the data utility of the generated datasets. A variance-inflated imputation model is introduced that incorporates additional variability in the model for records that are not sufficiently protected by the standard synthesis.  相似文献   

9.
Small area statistics obtained from sample survey data provide a critical source of information used to study health, economic, and sociological trends. However, most large-scale sample surveys are not designed for the purpose of producing small area statistics. Moreover, data disseminators are prevented from releasing public-use microdata for small geographic areas for disclosure reasons; thus, limiting the utility of the data they collect. This research evaluates a synthetic data method, intended for data disseminators, for releasing public-use microdata for small geographic areas based on complex sample survey data. The method replaces all observed survey values with synthetic (or imputed) values generated from a hierarchical Bayesian model that explicitly accounts for complex sample design features, including stratification, clustering, and sampling weights. The method is applied to restricted microdata from the National Health Interview Survey and synthetic data are generated for both sampled and non-sampled small areas. The analytic validity of the resulting small area inferences is assessed by direct comparison with the actual data, a simulation study, and a cross-validation study.  相似文献   

10.
Many of the available methods for estimating small-area parameters are model-based approaches in which auxiliary variables are used to predict the variable of interest. For models that are nonlinear, prediction is not straightforward. MacGibbon and Tomberlin and Farrell, MacGibbon, and Tomberlin have proposed approaches that require microdata for all individuals in a small area. In this article, we develop a method, based on a second-order Taylor-series expansion to obtain model-based predictions, that requires only local-area summary statistics for both continuous and categorical auxiliary variables. The methodology is evaluated using data based on a U.S. Census.  相似文献   

11.
The performance of Statistical Disclosure Control (SDC) methods for microdata (also called masking methods) is measured in terms of the utility and the disclosure risk associated to the protected microdata set. Empirical disclosure risk assessment based on record linkage stands out as a realistic and practical disclosure risk assessment methodology which is applicable to every conceivable masking method. The intruder is assumed to know an external data set, whose records are to be linked to those in the protected data set; the percent of correctly linked record pairs is a measure of disclosure risk. This paper reviews conventional record linkage, which assumes shared variables between the external and the protected data sets, and then shows that record linkage—and thus disclosure—is still possible without shared variables.  相似文献   

12.
Ecological studies are based on characteristics of groups of individuals, which are common in various disciplines including epidemiology. It is of great interest for epidemiologists to study the geographical variation of a disease by accounting for the positive spatial dependence between neighbouring areas. However, the choice of scale of the spatial correlation requires much attention. In view of a lack of studies in this area, this study aims to investigate the impact of differing definitions of geographical scales using a multilevel model. We propose a new approach – the grid-based partitions and compare it with the popular census region approach. Unexplained geographical variation is accounted for via area-specific unstructured random effects and spatially structured random effects specified as an intrinsic conditional autoregressive process. Using grid-based modelling of random effects in contrast to the census region approach, we illustrate conditions where improvements are observed in the estimation of the linear predictor, random effects, parameters, and the identification of the distribution of residual risk and the aggregate risk in a study region. The study has found that grid-based modelling is a valuable approach for spatially sparse data while the statistical local area-based and grid-based approaches perform equally well for spatially dense data.  相似文献   

13.
Business establishment microdata typically are required to satisfy agency-specified edit rules, such as balance equations and linear inequalities. Inevitably some establishments' reported data violate the edit rules. Statistical agencies correct faulty values using a process known as edit-imputation. Business establishment data also must be heavily redacted before being shared with the public; indeed, confidentiality concerns lead many agencies not to share establishment microdata as unrestricted access files. When microdata must be heavily redacted, one approach is to create synthetic data, as done in the U.S. Longitudinal Business Database and the German IAB Establishment Panel. This article presents the first implementation of a fully integrated approach to edit-imputation and data synthesis. We illustrate the approach on data from the U.S. Census of Manufactures and present a variety of evaluations of the utility of the synthetic data. The paper also presents assessments of disclosure risks for several intruder attacks. We find that the synthetic data preserve important distributional features from the post-editing confidential microdata, and have low risks for the various attacks.  相似文献   

14.
Multivariate shrinkage estimation of small area means and proportions   总被引:3,自引:0,他引:3  
The familiar (univariate) shrinkage estimator of a small area mean or proportion combines information from the small area and a national survey. We define a multivariate shrinkage estimator which combines information also across subpopulations and outcome variables. The superiority of the multivariate shrinkage over univariate shrinkage, and of the univariate shrinkage over the unbiased (sample) means, is illustrated on examples of estimating the local area rates of economic activity in the subpopulations defined by ethnicity, age and sex. The examples use the sample of anonymized records of individuals from the 1991 UK census. The method requires no distributional assumptions but relies on the appropriateness of the quadratic loss function. The implementation of the method involves minimum outlay of computing. Multivariate shrinkage is particularly effective when the area level means are highly correlated and the sample means of one or a few components have small sampling and between-area variances. Estimations for subpopulations based on small samples can be greatly improved by incorporating information from subpopulations with larger sample sizes.  相似文献   

15.
Summary. Protection against disclosure is important for statistical agencies releasing microdata files from sample surveys. Simple measures of disclosure risk can provide useful evidence to support decisions about release. We propose a new measure of disclosure risk: the probability that a unique match between a microdata record and a population unit is correct. We argue that this measure has at least two advantages. First, we suggest that it may be a more realistic measure of risk than two measures that are currently used with census data. Second, we show that consistent inference (in a specified sense) may be made about this measure from sample data without strong modelling assumptions. This is a surprising finding, in its contrast with the properties of the two 'similar' established measures. As a result, this measure has potentially useful applications to sample surveys. In addition to obtaining a simple consistent predictor of the measure, we propose a simple variance estimator and show that it is consistent. We also consider the extension of inference to allow for certain complex sampling schemes. We present a numerical study based on 1991 census data for about 450 000 enumerated individuals in one area of Great Britain. We show that the theoretical results on the properties of the point predictor of the measure of risk and its variance estimator hold to a good approximation for these data.  相似文献   

16.
Bayesian networks for imputation   总被引:1,自引:0,他引:1  
Summary.  Bayesian networks are particularly useful for dealing with high dimensional statistical problems. They allow a reduction in the complexity of the phenomenon under study by representing joint relationships between a set of variables through conditional relationships between subsets of these variables. Following Thibaudeau and Winkler we use Bayesian networks for imputing missing values. This method is introduced to deal with the problem of the consistency of imputed values: preservation of statistical relationships between variables ( statistical consistency ) and preservation of logical constraints in data ( logical consistency ). We perform some experiments on a subset of anonymous individual records from the 1991 UK population census.  相似文献   

17.
In many socio-economic surveys the objective is estimation of total or proportion of persons with a particular attribute. Multi-stage area samples are drawn from geographic strata and population within areal units is used as an auxiliary variable in ratio estimation. For large administrative areas, the auxiliary variable totals are available as population projections based on the last census. However, for small areas population changes are significantly affected by non-demographic factors and hence projections with high enough reliability are not available for small areas. In such situations the efficiency of design-based estimators for small areas can be improved by a ratio adjustment based on the auxiliary variable total for a large area. An inequality on the efficiency of the ratio adjusted estimator is established and its bias and variance is investigated  相似文献   

18.
Summary.  Statistical methods of ecological analysis that attempt to reduce ecological bias are empirically evaluated to determine in which circumstances each method might be practicable. The method that is most successful at reducing ecological bias is stratified ecological regression. It allows individual level covariate information to be incorporated into a stratified ecological analysis, as well as the combination of disease and risk factor information from two separate data sources, e.g. outcomes from a cancer registry and risk factor information from the census sample of anonymized records data set. The aggregated individual level model compares favourably with this model but has convergence problems. In addition, it is shown that the large areas that are covered by local authority districts seem to reduce between-area variability and may therefore not be as informative as conducting a ward level analysis. This has policy implications because access to ward level data is restricted.  相似文献   

19.
To limit the risks of disclosures when releasing data to the public, it has been suggested that statistical agencies release multiply imputed, synthetic microdata. For example, the released microdata can be fully synthetic, comprising random samples of units from the sampling frame with simulated values of variables. Or, the released microdata can be partially synthetic, comprising the units originally surveyed with some collected values, e.g. sensitive values at high risk of disclosure or values of key identifiers, replaced with multiple imputations. This article presents inferential methods for synthetic data for multi-component estimands, in particular procedures for Wald and likelihood ratio tests. The performance of the procedures is illustrated with simulation studies.  相似文献   

20.
中国农业普查事后质量抽查   总被引:1,自引:0,他引:1       下载免费PDF全文
一、PES数据分析模型由于抽查不只是对普查的重复 ,而是针对已进行的调查来收集“真实值”的一种方法。在中国 ,PES的入户访问是按与普查相同的环境来进行的 ,目的是要得到尽可能接近真实的调查结果。从PES得到的数据要与普查数据相比较。对两次调查的回答逐一比较 ,会得到如下的结果 :①普查和抽查回答一致 ;②两次调查的回答不一致 :回答的差别越大 ,普查估计的可靠性就越低。PES是为了检查和评估普查的数据质量。对PES回答误差的研究表明 ,回答误差也影响了PES数据。下面 ,我们将对两次调查同等对待。估计可靠性的指标…  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号