首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In 1991 Marsh and co-workers made the case for a sample of anonymized records (SAR) from the 1991 census of population. The case was accepted by the Office for National Statistics (then the Office of Population Censuses and Surveys) and a request was made by the Economic and Social Research Council to purchase the SARs. Two files were released for Great Britain—a 2% sample of individuals and a 1% sample of households. Subsequently similar samples were released for Northern Ireland. Since their release, the files have been heavily used for research and there has been no known breach of confidentiality. There is a considerable demand for similar files from the 2001 census, with specific requests for a larger sample size and lower population threshold for the individual SAR. This paper reassesses the analysis of Marsh and co-workers of the risk of identification of an individual or household in a sample of microdata from the 1991 census and also uses alternative ways of assessing risks with the 1991 SARs. The results of both the reassessment and the new analyses are reassuring and allow us to take the 1991 SARs as a base-line against which to assess proposals for changes to the size and structure of samples from the 2001 census.  相似文献   

2.
The case for small area microdata   总被引:3,自引:2,他引:1  
Summary.  Census data are available in aggregate form for local areas and, through the samples of anonymized records (SARs), as samples of microdata for households and individuals. In 1991 there were two SAR files: a household file and an individual file. These have a high degree of detail on the census variables but little geographical detail, a situation that will be exacerbated for the 2001 SAR owing to the loss of district level geography on the individual SAR. The paper puts forward the case for an additional sample of microdata, also drawn from the census, that has much greater geographical detail. Small area microdata (SAM) are individual level records with local area identifiers and, to maintain confidentiality, reduced detail on the census variables. Population data from seven local authorities, including rural and urban areas, are used to define prototype samples of SAM. The rationale for SAM is given, with examples that demonstrate the role of local area information in the analysis of census data. Since there is a trade-off between the extent of local detail and the extent of detail on variables that can be made available, the confidentiality risk of SAM is assessed empirically. An indicative specification of the SAM is given, having taken into account the results of the confidentiality analysis.  相似文献   

3.
Summary.  Statistical methods of ecological analysis that attempt to reduce ecological bias are empirically evaluated to determine in which circumstances each method might be practicable. The method that is most successful at reducing ecological bias is stratified ecological regression. It allows individual level covariate information to be incorporated into a stratified ecological analysis, as well as the combination of disease and risk factor information from two separate data sources, e.g. outcomes from a cancer registry and risk factor information from the census sample of anonymized records data set. The aggregated individual level model compares favourably with this model but has convergence problems. In addition, it is shown that the large areas that are covered by local authority districts seem to reduce between-area variability and may therefore not be as informative as conducting a ward level analysis. This has policy implications because access to ward level data is restricted.  相似文献   

4.
 实证研究离不开数据,当前,官方汇总数据日益成为一种公共产品,研究团体和社会公众有很多渠道获取。但是,由于技术、经济、法律、甚至是政治等种种因素的制约,微观统计数据共享和传播渠道缺失,迫使研究团体和个人自己去进行数据收集,造成大量的重复劳动和财力时间的浪费。同时,对于已有微观统计数据的开发不足,降低了数据收集的回报,严重制约了统计能力的提升。本文对微观数据发布的现状进行了中外比较,讨论了微观数据发布的效用与风险,指出最关键的问题是满足日益增长的数据需求和统计泄密风险的矛盾,并且介绍了当前国际上常用的控制泄密风险的方法,并最终结合实际情况对中国的微观数据发布提出相关的建议。  相似文献   

5.
National statistical agencies and other data custodians collect and hold a vast amount of survey and census data, containing information vital for research and policy analysis. However, the problem of allowing analysis of these data, while protecting respondent confidentiality, has proved challenging to address. In this paper we will focus on the remote analysis approach, under which a confidential dataset is held in a secure environment under the direct control of the data custodian agency. A computer system within the secure environment accepts a query from an analyst, runs it on the data, then returns the results to the analyst. In particular, the analyst does not have direct access to the data at all, and cannot view any microdata records. We further focus on the fitting of linear regression models to confidential data in the presence of outliers and influential points, such as are often present in business data. We propose a new method for protecting confidentiality in linear regression via a remote analysis system, that provides additional confidentiality protection for outliers and influential points in the data. The method we describe in this paper was designed for the prototype DataAnalyser system developed by the Australian Bureau of Statistics, however the method would be suitable for similar remote analysis systems.  相似文献   

6.
Business establishment microdata typically are required to satisfy agency-specified edit rules, such as balance equations and linear inequalities. Inevitably some establishments' reported data violate the edit rules. Statistical agencies correct faulty values using a process known as edit-imputation. Business establishment data also must be heavily redacted before being shared with the public; indeed, confidentiality concerns lead many agencies not to share establishment microdata as unrestricted access files. When microdata must be heavily redacted, one approach is to create synthetic data, as done in the U.S. Longitudinal Business Database and the German IAB Establishment Panel. This article presents the first implementation of a fully integrated approach to edit-imputation and data synthesis. We illustrate the approach on data from the U.S. Census of Manufactures and present a variety of evaluations of the utility of the synthetic data. The paper also presents assessments of disclosure risks for several intruder attacks. We find that the synthetic data preserve important distributional features from the post-editing confidential microdata, and have low risks for the various attacks.  相似文献   

7.
8.
To limit the risks of disclosures when releasing data to the public, it has been suggested that statistical agencies release multiply imputed, synthetic microdata. For example, the released microdata can be fully synthetic, comprising random samples of units from the sampling frame with simulated values of variables. Or, the released microdata can be partially synthetic, comprising the units originally surveyed with some collected values, e.g. sensitive values at high risk of disclosure or values of key identifiers, replaced with multiple imputations. This article presents inferential methods for synthetic data for multi-component estimands, in particular procedures for Wald and likelihood ratio tests. The performance of the procedures is illustrated with simulation studies.  相似文献   

9.
Before releasing survey data, statistical agencies usually perturb the original data to keep each survey unit''s information confidential. One significant concern in releasing survey microdata is identity disclosure, which occurs when an intruder correctly identifies the records of a survey unit by matching the values of some key (or pseudo-identifying) variables. We examine a recently developed post-randomization method for a strict control of identification risks in releasing survey microdata. While that procedure well preserves the observed frequencies and hence statistical estimates in case of simple random sampling, we show that in general surveys, it may induce considerable bias in commonly used survey-weighted estimators. We propose a modified procedure that better preserves weighted estimates. The procedure is illustrated and empirically assessed with an application to a publicly available US Census Bureau data set.  相似文献   

10.
The value of using exact rather than asymptotic tests to measure intermarriage in the United Kingdom is examined. "We develop Markov chain Monte Carlo methods for estimating the exact conditional p-value and the exact distribution of the residuals, for quasi-independence and quasi-symmetry. These methods are used to analyse a sparse 10 x 10 symmetric table of interethnic unions, extracted from the 1% household sample of anonymized records from the 1991 U.K. census. With the exception of Pakistani/White and Other Asian/White unions, there is no evidence against quasi-symmetry. We conclude that, with these exceptions, there are no gender differences in the affinities between ethnic groups."  相似文献   

11.
The public use sample from the 1991 UK census makes it possible to conduct individual level analyses of ethnic minorities' educational and occupational attainments. Unfortunately, however, the census asked only about higher level qualifications obtained after reaching 18 years of age. A comparison with the Labour Force Surveys (LFSs) shows that the census gives in some respects a misleading impression of qualifications among the first-generation members of ethnic minorities: the LFS data show that ethnic minorities tend to be more polarized in their qualifications than the British-born whites, with relatively large proportions at the two extremes, either with degrees or with no qualifications at all. It follows that the census's treatment of qualifications may tend to exaggerate the scale of disadvantage of ethnic minorities in the labour market, and particularly in access to the salariat where qualifications play a particularly large role in recruitment. Regression analyses of sample of anonymized records and LFS data confirm these expectations although they indicate that the results of the census are not seriously misleading as regards the pattern of ethnic disadvantages in the competition to avoid unemployment. The LFS data also confirm earlier findings that the ethnic penalties are in general of similar magnitude among the second generation to those among the first generation, despite the substantial equalization of educational experience that has taken place. There is some evidence that disadvantages in access to the salariat may have been reduced, but this is counterbalanced by the evidence that disadvantages in the avoidance of unemployment may have deteriorated.  相似文献   

12.
Multivariate shrinkage estimation of small area means and proportions   总被引:3,自引:0,他引:3  
The familiar (univariate) shrinkage estimator of a small area mean or proportion combines information from the small area and a national survey. We define a multivariate shrinkage estimator which combines information also across subpopulations and outcome variables. The superiority of the multivariate shrinkage over univariate shrinkage, and of the univariate shrinkage over the unbiased (sample) means, is illustrated on examples of estimating the local area rates of economic activity in the subpopulations defined by ethnicity, age and sex. The examples use the sample of anonymized records of individuals from the 1991 UK census. The method requires no distributional assumptions but relies on the appropriateness of the quadratic loss function. The implementation of the method involves minimum outlay of computing. Multivariate shrinkage is particularly effective when the area level means are highly correlated and the sample means of one or a few components have small sampling and between-area variances. Estimations for subpopulations based on small samples can be greatly improved by incorporating information from subpopulations with larger sample sizes.  相似文献   

13.
Under given concrete exogenous conditions, the fraction of identifiable records in a microdata file without positive identifiers such as name and address is estimated. The effect of possible noise in the data, as well as the sample property of microdata files, is taken into account. Using real microdata files, it is shown that there is no risk of disclosure if the information content of characteristics known to the investigator (additional knowledge) is limited. Files with additional knowledge of large information content yield a high risk of disclosure. This can be eliminated only by massive modifications of the data records, which, however, involve large biases for complex statistical evaluations. In this case, the requirement for privacy protection and high-quality data perhaps may be fulfilled only if the linkage of such files with extensive additional knowledge is prevented by appropriate organizational and legal restrictions.  相似文献   

14.
The performance of Statistical Disclosure Control (SDC) methods for microdata (also called masking methods) is measured in terms of the utility and the disclosure risk associated to the protected microdata set. Empirical disclosure risk assessment based on record linkage stands out as a realistic and practical disclosure risk assessment methodology which is applicable to every conceivable masking method. The intruder is assumed to know an external data set, whose records are to be linked to those in the protected data set; the percent of correctly linked record pairs is a measure of disclosure risk. This paper reviews conventional record linkage, which assumes shared variables between the external and the protected data sets, and then shows that record linkage—and thus disclosure—is still possible without shared variables.  相似文献   

15.
Summary. Protection against disclosure is important for statistical agencies releasing microdata files from sample surveys. Simple measures of disclosure risk can provide useful evidence to support decisions about release. We propose a new measure of disclosure risk: the probability that a unique match between a microdata record and a population unit is correct. We argue that this measure has at least two advantages. First, we suggest that it may be a more realistic measure of risk than two measures that are currently used with census data. Second, we show that consistent inference (in a specified sense) may be made about this measure from sample data without strong modelling assumptions. This is a surprising finding, in its contrast with the properties of the two 'similar' established measures. As a result, this measure has potentially useful applications to sample surveys. In addition to obtaining a simple consistent predictor of the measure, we propose a simple variance estimator and show that it is consistent. We also consider the extension of inference to allow for certain complex sampling schemes. We present a numerical study based on 1991 census data for about 450 000 enumerated individuals in one area of Great Britain. We show that the theoretical results on the properties of the point predictor of the measure of risk and its variance estimator hold to a good approximation for these data.  相似文献   

16.
Summary.  The one-number census approach was developed by the Office for National Statistics to adjust the counts from the 2001 census of England and Wales for underenumeration. The method is underpinned by an assumption of independence between the count of the population that was given by the 2001 census and the count that was given by the Census Coverage Survey. Some dependence was, however, detected, and the paper describes the strategy that was used to measure dependence and to adjust the 2001 census population estimates.  相似文献   

17.
Small area statistics obtained from sample survey data provide a critical source of information used to study health, economic, and sociological trends. However, most large-scale sample surveys are not designed for the purpose of producing small area statistics. Moreover, data disseminators are prevented from releasing public-use microdata for small geographic areas for disclosure reasons; thus, limiting the utility of the data they collect. This research evaluates a synthetic data method, intended for data disseminators, for releasing public-use microdata for small geographic areas based on complex sample survey data. The method replaces all observed survey values with synthetic (or imputed) values generated from a hierarchical Bayesian model that explicitly accounts for complex sample design features, including stratification, clustering, and sampling weights. The method is applied to restricted microdata from the National Health Interview Survey and synthetic data are generated for both sampled and non-sampled small areas. The analytic validity of the resulting small area inferences is assessed by direct comparison with the actual data, a simulation study, and a cross-validation study.  相似文献   

18.
When tables are generated from a data file, the release of those tables should not reveal too detailed information concerning individual respondents. The disclosure of individual respondents in the microdata file can be prevented by applying disclosure control methods at the table level (by cell suppression or cell perturbation), but this may create inconsistencies among other tables based on the same data file. Alternatively, disclosure control methods can be applied at the microdata level, but these methods may change the data permanently and do not account for specific table properties. These problems can be circumvented by assigning a (single and fixed) weight factor to each respondent/record in the microdata file. Normally this weight factor is equal to 1 for each record, and is not explicitly incorporated in the microdata file. Upon tabulation, each contribution of a respondent is weighted multiplicatively by the respondent's weight factor. This approach is called Source Data Perturbation (SDP) because the data is perturbed at the microdata level, not at the table level. It should be noted, however, that the data in the original microdata is not changed; only a weight variable is added. The weight factors can be chosen in accordance with the SDC paradigm, i.e. such that the tables generated from the microdata are safe, and the information loss is minimized. The paper indicates how this can be done. Moreover it is shown that the SDP approach is very suitable for use in data warehouses, as the weights can be conveniently put in the fact tables. The data can then still be accessed and sliced and diced up to a certain level of detail, and tables generated from the data warehouse are mutually consistent and safe.  相似文献   

19.
陶然 《统计研究》2014,31(8):66-72
从普查数据准确性检验到普查数据误差测量,有关周期性普查数据质量的评估研究随着人们对普查数据生产过程的认识而不断发展与完善,先后发展出包括人口统计分析、数据一致性检验、行政记录审核和事后质量抽查等技术思路,包含多种具体的评估方法。根据评估依据与普查系统的关系,这些评估方法可被归纳为内部评估、外部评估以及事后抽查等三种评估途径。通过对每种评估方法特点及其适用性的分析,有利于进一步深化我国周期性普查数据质量评估的理论研究。  相似文献   

20.
Summary.  The 2001 census in the UK asked for a return of people 'usually living at this address'. But this phrase is fuzzy and may have led to undercount. In addition, analysis of the sex ratios in the 2001 census of England and Wales points to a sex bias in the adjustments for net undercount—too few males in relation to females. The Office for National Statistics's abandonment of the method of demographic analysis for the population of working ages has allowed these biases to creep in. The paper presents a demographic account to check on the plausibility of census results. The need to revise preliminary estimates of the national population over a period of years following census day—as experienced in North America and now in the UK—calls into question the feasibility of a one-number census. Looking to the future, the environment for taking a reliable census by conventional methods is deteriorating. The UK Government's proposals for a population register open up the possibility of a Nordic-style administrative record census in the longer term.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号