首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 468 毫秒
1.
When preparing data for public release, information organizations face the challenge of preserving the quality of data while protecting the confidentiality of both data subjects and sensitive data attributes. Without knowing what type of analyses will be conducted by data users, it is often hard to alter data without sacrificing data utility. In this paper, we propose a new approach to mitigate this difficulty, which entails using Bayesian additive regression trees (BART), in connection with existing methods for statistical disclosure limitation, to help preserve data utility while meeting confidentiality requirements. We illustrate the performance of our method through both simulation and a data example. The method works well when the targeted relationship underlying the original data is not weak, and the performance appears to be robust to the intensity of alteration.  相似文献   

2.
The problem of limiting the disclosure of information gathered on a set of companies or individuals (the respondents) is considered, the aim being to provide useful information while preserving confidentiality of sensitive information. The paper proposes a method which explicitly preserves certain information contained in the data. The data are assumed to consist of two sets of information on each respondent: public data and specific survey data. It is assumed in this paper that both sets of data are liable to be released for a subset of respondents. However, the public data will be altered in some way to preserve confidentiality whereas the specific survey data is to be disclosed without alteration. The paper proposes a model based approach to this problem by utilizing the information contained in the sufficient statistics obtained from fitting a model to the public data by conditioning on the survey data. Deterministic and stochastic variants of the method are considered.  相似文献   

3.
Summary.  Statistical agencies that own different databases on overlapping subjects can benefit greatly from combining their data. These benefits are passed on to secondary data analysts when the combined data are disseminated to the public. Sometimes combining data across agencies or sharing these data with the public is not possible: one or both of these actions may break promises of confidentiality that have been given to data subjects. We describe an approach that is based on two stages of multiple imputation that facilitates data sharing and dissemination under restrictions of confidentiality. We present new inferential methods that properly account for the uncertainty that is caused by the two stages of imputation. We illustrate the approach by using artificial and genuine data.  相似文献   

4.
微观统计数据的公布及相应的保密方法   总被引:1,自引:0,他引:1       下载免费PDF全文
 目前国内大多数的数据机构所收集的微观数据并没有直接对外公布。由于微观层面的数据不能为外界所用,也造成了一种社会资源的浪费。我们认为,数据机构采用适当的方法对微观数据进行处理,然后对外公布处理后的数据,可以较好地解决这一问题。一方面,原始数据的绝大部分信息得以保存,可以满足不同数据用户的需求,另一方面,数据泄密的风险也被大大降低,能满足数据机构保密的需求。 本文的主要目的是通过介绍国外的一些普遍采用的微观数据的处理方法,以期为国内数据机构公布微观数据提供理论依据和一些切实可行的操作方法,借以抛砖引玉,希望可以引起国内数据机构的重视及统计学界在此方面更多的研究和创新。  相似文献   

5.
Owing to the growing concerns over data confidentiality, many national statistical agencies are considering remote access servers to disseminate data to the public. With remote servers, users submit requests for output from statistical models fit using the collected data, but they are not allowed access to the data. Remote servers also should enable users to check the fit of their models; however, standard diagnostics like residuals or influence statistics can disclose individual data values. In this article, we present diagnostics for categorical data regressions that can be safely and usefully employed in remote servers. We illustrate the diagnostics with simulation studies.  相似文献   

6.
The author examines the content and methodology of the 1989 Soviet census. Items to be included in the questionnaires for both the total population and the 25 percent sample survey are described. The author notes that questions on housing will be included for the first time. Other topics considered include reliability of data, confidentiality, and organization.  相似文献   

7.
National statistical agencies and other data custodians collect and hold a vast amount of survey and census data, containing information vital for research and policy analysis. However, the problem of allowing analysis of these data, while protecting respondent confidentiality, has proved challenging to address. In this paper we will focus on the remote analysis approach, under which a confidential dataset is held in a secure environment under the direct control of the data custodian agency. A computer system within the secure environment accepts a query from an analyst, runs it on the data, then returns the results to the analyst. In particular, the analyst does not have direct access to the data at all, and cannot view any microdata records. We further focus on the fitting of linear regression models to confidential data in the presence of outliers and influential points, such as are often present in business data. We propose a new method for protecting confidentiality in linear regression via a remote analysis system, that provides additional confidentiality protection for outliers and influential points in the data. The method we describe in this paper was designed for the prototype DataAnalyser system developed by the Australian Bureau of Statistics, however the method would be suitable for similar remote analysis systems.  相似文献   

8.
9.
For micro-datasets considered for release as scientific or public use files, statistical agencies have to face the dilemma of guaranteeing the confidentiality of survey respondents on the one hand and offering sufficiently detailed data on the other hand. For that reason, a variety of methods to guarantee disclosure control is discussed in the literature. In this paper, we present an application of Rubin’s (J. Off. Stat. 9, 462–468, 1993) idea to generate synthetic datasets from existing confidential survey data for public release.We use a set of variables from the 1997 wave of the German IAB Establishment Panel and evaluate the quality of the approach by comparing results from an analysis by Zwick (Ger. Econ. Rev. 6(2), 155–184, 2005) with the original data with the results we achieve for the same analysis run on the dataset after the imputation procedure. The comparison shows that valid inferences can be obtained using the synthetic datasets in this context, while confidentiality is guaranteed for the survey participants.  相似文献   

10.
The case for small area microdata   总被引:3,自引:2,他引:1  
Summary.  Census data are available in aggregate form for local areas and, through the samples of anonymized records (SARs), as samples of microdata for households and individuals. In 1991 there were two SAR files: a household file and an individual file. These have a high degree of detail on the census variables but little geographical detail, a situation that will be exacerbated for the 2001 SAR owing to the loss of district level geography on the individual SAR. The paper puts forward the case for an additional sample of microdata, also drawn from the census, that has much greater geographical detail. Small area microdata (SAM) are individual level records with local area identifiers and, to maintain confidentiality, reduced detail on the census variables. Population data from seven local authorities, including rural and urban areas, are used to define prototype samples of SAM. The rationale for SAM is given, with examples that demonstrate the role of local area information in the analysis of census data. Since there is a trade-off between the extent of local detail and the extent of detail on variables that can be made available, the confidentiality risk of SAM is assessed empirically. An indicative specification of the SAM is given, having taken into account the results of the confidentiality analysis.  相似文献   

11.
操作风险损失事件的数据一般较为匮乏,这会影响到模型参数估计的准确性,进而导致经济资本配置的偏差和风险控制能力的降低。在损失分布法的框架下,运用基于MCMC模拟的贝叶斯方法,借助WinBUGS软件包通过Gibbs抽样构造出负二项分布和帕累托分布的稳态马尔可夫链,以分别动态模拟操作风险损失频率和强度的后验分布,计算出操作风险所要求的经济资本。对比极大似然估计法,实证结果表明,在小样本条件下此方法可以取得较好的结果。  相似文献   

12.
Business establishment microdata typically are required to satisfy agency-specified edit rules, such as balance equations and linear inequalities. Inevitably some establishments' reported data violate the edit rules. Statistical agencies correct faulty values using a process known as edit-imputation. Business establishment data also must be heavily redacted before being shared with the public; indeed, confidentiality concerns lead many agencies not to share establishment microdata as unrestricted access files. When microdata must be heavily redacted, one approach is to create synthetic data, as done in the U.S. Longitudinal Business Database and the German IAB Establishment Panel. This article presents the first implementation of a fully integrated approach to edit-imputation and data synthesis. We illustrate the approach on data from the U.S. Census of Manufactures and present a variety of evaluations of the utility of the synthetic data. The paper also presents assessments of disclosure risks for several intruder attacks. We find that the synthetic data preserve important distributional features from the post-editing confidential microdata, and have low risks for the various attacks.  相似文献   

13.
Statistical Classification Methods in Consumer Credit Scoring: a Review   总被引:11,自引:0,他引:11  
Credit scoring is the term used to describe formal statistical methods used for classifying applicants for credit into 'good' and 'bad' risk classes. Such methods have become increasingly important with the dramatic growth in consumer credit in recent years. A wide range of statistical methods has been applied, though the literature available to the public is limited for reasons of commercial confidentiality. Particular problems arising in the credit scoring context are examined and the statistical methods which have been applied are reviewed.  相似文献   

14.
Disseminating microdata to the public that provide a high level of data utility, while at the same time guaranteeing the confidentiality of the survey respondent is a difficult task. Generating multiply imputed synthetic datasets is an innovative statistical disclosure limitation technique with the potential of enabling the data disseminating agency to achieve this twofold goal. So far, the approach was successfully implemented only for a limited number of datasets in the U.S. In this paper, we present the first successful implementation outside the U.S.: the generation of partially synthetic datasets for an establishment panel survey at the German Institute for Employment Research. We describe the whole evolution of the project: from the early discussions concerning variables at risk to the final synthesis. We also present our disclosure risk evaluations and provide some first results on the data utility of the generated datasets. A variance-inflated imputation model is introduced that incorporates additional variability in the model for records that are not sufficiently protected by the standard synthesis.  相似文献   

15.
The European Federation of Statisticians in the Pharmaceutical Industry (EFSPI) believes access to clinical trial data should be implemented in a way that supports good research, avoids misuse of such data, lies within the scope of the original informed consent and fully protects patient confidentiality. In principle, EFSPI supports responsible data sharing. EFSPI acknowledges it is in the interest of patients that their data are handled in a strictly confidential manner to avoid misuse under all possible circumstances. It is also in the interest of the altruistic nature of patients participating in trials that such data will be used for further development of science as much as possible applying good statistical principles. This paper summarises EFSPI's position on access to clinical trial data. The position was developed during the European Medicines Agency (EMA) advisory process and before the draft EMA policy on publication and access to clinical trial data was released for consultation; however, the EFSPI's position remains unchanged following the release of the draft policy. Finally, EFSPI supports a need for further guidance to be provided on important technical aspects relating to re‐analyses and additional analyses of clinical trial data, for example, multiplicity, meta‐analysis, subgroup analyses and publication bias. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

16.
Summary: One specific problem statistical offices and research institutes are faced with when releasing microdata is the preservation of confidentiality. Traditional methods to avoid disclosure often destroy the structure of the data, and information loss is potentially high. In this paper an alternative technique of creating scientific–use files is discussed, which reproduces the characteristics of the original data quite well. It is based on Fienberg (1997, 1994) who estimates and resamples from the empirical multivariate cumulative distribution function of the data in order to get synthetic data. The procedure creates data sets – the resample – which have the same characteristics as the original survey data. The paper includes some applications of this method with (a) simulated data and (b) innovation survey data, the Mannheim Innovation Panel (MIP), and a comparison between resampling and a common method of disclosure control (disturbance with multiplicative error) with regard to confidentiality on the one hand and the appropriateness of the disturbed data for different kinds of analyses on the other. The results show that univariate distributions can be better reproduced by unweighted resampling. Parameter estimates can be reproduced quite well if the resampling procedure implements the correlation structure of the original data as a scale or if the data is multiplicatively perturbed and a correction term is used. On average, anonymization of data with multiplicatively perturbed values protects better against re–identification than the various resampling methods used.  相似文献   

17.
In May 2013, GlaxoSmithKline (980 Great West Road, Brentford, Middlesex, TW8 9GS, UK) established a new online system to enable scientific researchers to request access to anonymised patient level clinical trial data. Providing access to individual patient data collected in clinical trials enables conduct of further research that may help advance medical science or improve patient care. In turn, this helps ensure that the data provided by research participants are used to maximum effect in the creation of new knowledge and understanding. However, when providing access to individual patient data, maintaining the privacy and confidentiality of research participants is critical. This article describes the approach we have taken to prepare data for sharing with other researchers in a way that minimises risk with respect to the privacy and confidentiality of research participants, ensures compliance with current data privacy legal requirements and yet retains utility of the anonymised datasets for research purposes. We recognise that there are different possible approaches and that broad consensus is needed. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

18.
When multiple data owners possess records on different subjects with the same set of attributes—known as horizontally partitioned data—the data owners can improve analyses by concatenating their databases. However, concatenation of data may be infeasible because of confidentiality concerns. In such settings, the data owners can use secure computation techniques to obtain the results of certain analyses on the integrated database without sharing individual records. We present secure computation protocols for Bayesian model averaging and model selection for both linear regression and probit regression. Using simulations based on genuine data, we illustrate the approach for probit regression, and show that it can provide reasonable model selection outputs.  相似文献   

19.
Statistical simulation in survey statistics is usually based on repeatedly drawing samples from population data. Furthermore, population data may be used in courses on survey statistics to explain issues regarding, e.g., sampling designs. Since the availability of real population data is in general very limited, it is necessary to generate synthetic data for such applications. The simulated data need to be as realistic as possible, while at the same time ensuring data confidentiality. This paper proposes a method for generating close-to-reality population data for complex household surveys. The procedure consists of four steps for setting up the household structure, simulating categorical variables, simulating continuous variables and splitting continuous variables into different components. It is not required to perform all four steps so that the framework is applicable to a broad class of surveys. In addition, the proposed method is evaluated in an application to the European Union Statistics on Income and Living Conditions (EU-SILC).  相似文献   

20.
Logistic regression is the most popular technique available for modeling dichotomous-dependent variables. It has intensive application in the field of social, medical, behavioral and public health sciences. In this paper we propose a more efficient logistic regression analysis based on moving extreme ranked set sampling (MERSSmin) scheme with ranking based on an easy-to-available auxiliary variable known to be associated with the variable of interest (response variable). The paper demonstrates that this approach will provide more powerful testing procedure as well as more efficient odds ratio and parameter estimation than using simple random sample (SRS). Theoretical derivation and simulation studies will be provided. Real data from 2011 Youth Risk Behavior Surveillance System (YRBSS) data are used to illustrate the procedures developed in this paper.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号