首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Summary. The study of human immunodeficiency virus dynamics is one of the most important areas in research into acquired immune deficiency syndrome in recent years. Non-linear mixed effects models have been proposed for modelling viral dynamic processes. A challenging problem in the modelling is to identify repeatedly measured (time-dependent), but possibly missing, immunologic or virologic markers (covariates) for viral dynamic parameters. For missing time-dependent covariates in non-linear mixed effects models, the commonly used complete-case, mean imputation and last value carried forward methods may give misleading results. We propose a three-step hierarchical multiple-imputation method, implemented by Gibbs sampling, which imputes the missing data at the individual level but can pool information across individuals. We compare various methods by Monte Carlo simulations and find that the multiple-imputation method proposed performs the best in terms of bias and mean-squared errors in the estimates of covariate coefficients. By applying the favoured multiple-imputation method to clinical data, we conclude that there is a negative correlation between the viral decay rate (a virological response parameter) and CD4 or CD8 cell counts during the treatment; this is counter-intuitive, but biologically interpretable on the basis of findings from other clinical studies. These results may have an important influence on decisions about treatment for acquired immune deficiency syndrome patients.  相似文献   

2.
In this paper we discuss a new theoretical basis for perturbation methods. In developing this new theoretical basis, we define the ideal measures of data utility and disclosure risk. Maximum data utility is achieved when the statistical characteristics of the perturbed data are the same as that of the original data. Disclosure risk is minimized if providing users with microdata access does not result in any additional information. We show that when the perturbed values of the confidential variables are generated as independent realizations from the distribution of the confidential variables conditioned on the non-confidential variables, they satisfy the data utility and disclosure risk requirements. We also discuss the relationship between the theoretical basis and some commonly used methods for generating perturbed values of confidential numerical variables.  相似文献   

3.
Summary.  The paper establishes a correspondence between statistical disclosure control and forensic statistics regarding their common use of the concept of 'probability of identification'. The paper then seeks to investigate what lessons for disclosure control can be learnt from the forensic identification literature. The main lesson that is considered is that disclosure risk assessment cannot, in general, ignore the search method that is employed by an intruder seeking to achieve disclosure. The effects of using several search methods are considered. Through consideration of the plausibility of assumptions and 'worst case' approaches, the paper suggests how the impact of search method can be handled. The paper focuses on foundations of disclosure risk assessment, providing some justification for some modelling assumptions underlying some existing record level measures of disclosure risk. The paper illustrates the effects of using various search methods in a numerical example based on microdata from a sample from the 2001 UK census.  相似文献   

4.
Three-mode analysis is a generalization of principal component analysis to three-mode data. While two-mode data consist of cases that are measured on several variables, three-mode data consist of cases that are measured on several variables at several occasions. As any other statistical technique, the results of three-mode analysis may be influenced by missing data. Three-mode software packages generally use the expectation–maximization (EM) algorithm for dealing with missing data. However, there are situations in which the EM algorithm is expected to break down. Alternatively, multiple imputation may be used for dealing with missing data. In this study we investigated the influence of eight different multiple-imputation methods on the results of three-mode analysis, more specifically, a Tucker2 analysis, and compared the results with those of the EM algorithm. Results of the simulations show that multilevel imputation with the mode with the most levels nested within cases and the mode with the least levels represented as variables gives the best results for a Tucker2 analysis. Thus, this may be a good alternative for the EM algorithm in handling missing data in a Tucker2 analysis.  相似文献   

5.
Self-reported income information particularly suffers from an intentional coarsening of the data, which is called heaping or rounding. If it does not occur completely at random – which is usually the case – heaping and rounding have detrimental effects on the results of statistical analysis. Conventional statistical methods do not consider this kind of reporting bias, and thus might produce invalid inference. We describe a novel statistical modeling approach that allows us to deal with self-reported heaped income data in an adequate and flexible way. We suggest modeling heaping mechanisms and the true underlying model in combination. To describe the true net income distribution, we use the zero-inflated log-normal distribution. Heaping points are identified from the data by applying a heuristic procedure comparing a hypothetical income distribution and the empirical one. To determine heaping behavior, we employ two distinct models: either we assume piecewise constant heaping probabilities, or heaping probabilities are considered to increase steadily with proximity to a heaping point. We validate our approach by some examples. To illustrate the capacity of the proposed method, we conduct a case study using income data from the German National Educational Panel Study.  相似文献   

6.
The article’s topic is logistic regression for direct data on the covariates, but indirect data on the endogenous variable. The indirect data may result from a privacy-protecting survey procedure for sensitive characteristics or from statistical disclosure control. Various procedures to generate the indirect data exist. However, we show that it is possible to develop a general approach for logistic regression analyses with indirect data that covers many procedures. We first derive a general algorithm for the maximum likelihood estimation and a general procedure for variance estimation. Subsequently, lots of examples demonstrate the broad applicability of our general framework.  相似文献   

7.
Statistical agencies have conflicting obligations to protect confidential information provided by respondents to surveys or censuses and to make data available for research and planning activities. When the microdata themselves are to be released, in order to achieve these conflicting objectives, statistical agencies apply statistical disclosure limitation (SDL) methods to the data, such as noise addition, swapping or microaggregation. Some of these methods do not preserve important structure and constraints in the data, such as positivity of some attributes or inequality constraints between attributes. Failure to preserve constraints is not only problematic in terms of data utility, but also may increase disclosure risk.In this paper, we describe a method for SDL that preserves both positivity of attributes and the mean vector and covariance matrix of the original data. The basis of the method is to apply multiplicative noise with the proper, data-dependent covariance structure.  相似文献   

8.
Before releasing survey data, statistical agencies usually perturb the original data to keep each survey unit''s information confidential. One significant concern in releasing survey microdata is identity disclosure, which occurs when an intruder correctly identifies the records of a survey unit by matching the values of some key (or pseudo-identifying) variables. We examine a recently developed post-randomization method for a strict control of identification risks in releasing survey microdata. While that procedure well preserves the observed frequencies and hence statistical estimates in case of simple random sampling, we show that in general surveys, it may induce considerable bias in commonly used survey-weighted estimators. We propose a modified procedure that better preserves weighted estimates. The procedure is illustrated and empirically assessed with an application to a publicly available US Census Bureau data set.  相似文献   

9.
Summary: One specific problem statistical offices and research institutes are faced with when releasing microdata is the preservation of confidentiality. Traditional methods to avoid disclosure often destroy the structure of the data, and information loss is potentially high. In this paper an alternative technique of creating scientific–use files is discussed, which reproduces the characteristics of the original data quite well. It is based on Fienberg (1997, 1994) who estimates and resamples from the empirical multivariate cumulative distribution function of the data in order to get synthetic data. The procedure creates data sets – the resample – which have the same characteristics as the original survey data. The paper includes some applications of this method with (a) simulated data and (b) innovation survey data, the Mannheim Innovation Panel (MIP), and a comparison between resampling and a common method of disclosure control (disturbance with multiplicative error) with regard to confidentiality on the one hand and the appropriateness of the disturbed data for different kinds of analyses on the other. The results show that univariate distributions can be better reproduced by unweighted resampling. Parameter estimates can be reproduced quite well if the resampling procedure implements the correlation structure of the original data as a scale or if the data is multiplicatively perturbed and a correction term is used. On average, anonymization of data with multiplicatively perturbed values protects better against re–identification than the various resampling methods used.  相似文献   

10.
Summary. Protection against disclosure is important for statistical agencies releasing microdata files from sample surveys. Simple measures of disclosure risk can provide useful evidence to support decisions about release. We propose a new measure of disclosure risk: the probability that a unique match between a microdata record and a population unit is correct. We argue that this measure has at least two advantages. First, we suggest that it may be a more realistic measure of risk than two measures that are currently used with census data. Second, we show that consistent inference (in a specified sense) may be made about this measure from sample data without strong modelling assumptions. This is a surprising finding, in its contrast with the properties of the two 'similar' established measures. As a result, this measure has potentially useful applications to sample surveys. In addition to obtaining a simple consistent predictor of the measure, we propose a simple variance estimator and show that it is consistent. We also consider the extension of inference to allow for certain complex sampling schemes. We present a numerical study based on 1991 census data for about 450 000 enumerated individuals in one area of Great Britain. We show that the theoretical results on the properties of the point predictor of the measure of risk and its variance estimator hold to a good approximation for these data.  相似文献   

11.
This paper describes data-swapping as an approach to disclosure control for statistical databases. Data-swapping is a data transformation technique where the underlying statistics of the data are preserved. It can be used as a basis for microdata release or to justify the release of tabulations.  相似文献   

12.
In the area of statistical limitation, releasing synthetic data sets has become a popular method for limiting the risks of disclosure of sensitive information and at the same time maintaining analytic utility of data. However, less work has been done on how to create synthetic contingency tables that preserve some summary statistics of the original table. Studies in this area have primarily focused on generating replacement tables that preserve the margins of the original table since the latter support statistical inferences for a large set of parametric tests and models. Yet, not all synthetic tables that preserve a set of margins yield consistent results. In this paper, we propose alternative synthetic table releases. We describe how to generate complete two-way contingency tables that have the same set of observed conditional frequencies by using tools from computational algebra. We study both the disclosure risk and the data utility associated with such synthetic tabular data releases, and compare them to the traditionally released synthetic tables.  相似文献   

13.
This article describes the effects on estimates of the size distribution of family-unit money income produced by adjusting CPS estimates for 1972 by adding several other data sources. Income estimates were adjusted on an individual-observation basis to make them consistent with independent control totals. As a result of these adjustments, mean income for all units rose 12 percent. The relative share of the top 5 percent increased substantially. Property income increased and wage income decreased in relative importance. The adjustment to mean income was largest for the oldest age group and smallest for the youngest age group.  相似文献   

14.
胡宗义  李毅 《统计研究》2020,37(4):59-74
本文利用我国2008年正式实施环境信息披露制度这一外生冲击构造准自然实验,基于2004-2017年我国285个城市的面板数据,通过双重差分法系统评估环境信息披露对工业污染物排放的影响。克服环境信息披露的测量困难与内生性问题,首次考察环境信息披露的污染减排效果,并借助数理模型对其影响机制进行规范阐释。研究发现,环境信息披露能够显著降低工业污染物排放水平,且该影响具有一定的时滞性和长期性;同时,减排作用会随地区环境污染程度和环境规制力度的增强而呈现递增规律;此外,机制分析的结果表明,其传导机制主要来自于产业结构转型和减排技术进步。为验证研究结论的稳健性,本文提供了平行趋势、工具变量、安慰剂等多种方法的检验结果。本文的研究在经验上丰富了环境信息披露与环境污染治理之间关系的相关讨论,为提升我国环境污染治理水平、打赢污染防治攻坚战提供有益的政策启示。  相似文献   

15.
Disseminating microdata to the public that provide a high level of data utility, while at the same time guaranteeing the confidentiality of the survey respondent is a difficult task. Generating multiply imputed synthetic datasets is an innovative statistical disclosure limitation technique with the potential of enabling the data disseminating agency to achieve this twofold goal. So far, the approach was successfully implemented only for a limited number of datasets in the U.S. In this paper, we present the first successful implementation outside the U.S.: the generation of partially synthetic datasets for an establishment panel survey at the German Institute for Employment Research. We describe the whole evolution of the project: from the early discussions concerning variables at risk to the final synthesis. We also present our disclosure risk evaluations and provide some first results on the data utility of the generated datasets. A variance-inflated imputation model is introduced that incorporates additional variability in the model for records that are not sufficiently protected by the standard synthesis.  相似文献   

16.
Researchers have been developing various extensions and modified forms of the Weibull distribution to enhance its capability for modeling and fitting different data sets. In this note, we investigate the potential usefulness of the new modification to the standard Weibull distribution called odd Weibull distribution in income economic inequality studies. Some mathematical and statistical properties of this model are proposed. We obtain explicit expressions for the first incomplete moment, quantile function, Lorenz and Zenga curves and related inequality indices. In addition to the well-known stochastic order based on Lorenz curve, the stochastic order based on Zenga curve is considered. Since the new generalized Weibull distribution seems to be suitable to model wealth, financial, actuarial and especially income distributions, these findings are fundamental in the understanding of how parameter values are related to inequality. Also, the estimation of parameters by maximum likelihood and moment methods is discussed. Finally, this distribution has been fitted to United States and Austrian income data sets and has been found to fit remarkably well in compare with the other widely used income models.  相似文献   

17.
The performance of Statistical Disclosure Control (SDC) methods for microdata (also called masking methods) is measured in terms of the utility and the disclosure risk associated to the protected microdata set. Empirical disclosure risk assessment based on record linkage stands out as a realistic and practical disclosure risk assessment methodology which is applicable to every conceivable masking method. The intruder is assumed to know an external data set, whose records are to be linked to those in the protected data set; the percent of correctly linked record pairs is a measure of disclosure risk. This paper reviews conventional record linkage, which assumes shared variables between the external and the protected data sets, and then shows that record linkage—and thus disclosure—is still possible without shared variables.  相似文献   

18.
Summary.  We apply multivariate shrinkage to estimate local area rates of unemployment and economic inactivity by using UK Labour Force Survey data. The method exploits the similarity of the rates of claiming unemployment benefit and the unemployment rates as defined by the International Labour Organisation. This is done without any distributional assumptions, merely relying on the high correlation of the two rates. The estimation is integrated with a multiple-imputation procedure for missing employment status of subjects in the database (item non-response). The hot deck method that is used in the imputations is adapted to reflect the uncertainty in the model for non-response. The method is motivated as a development (improvement) of the current operational procedure in which the imputed value is a non-stochastic function of the data. An extension of the procedure to subjects who are absent from the database (unit non-response) is proposed.  相似文献   

19.
For micro-datasets considered for release as scientific or public use files, statistical agencies have to face the dilemma of guaranteeing the confidentiality of survey respondents on the one hand and offering sufficiently detailed data on the other hand. For that reason, a variety of methods to guarantee disclosure control is discussed in the literature. In this paper, we present an application of Rubin’s (J. Off. Stat. 9, 462–468, 1993) idea to generate synthetic datasets from existing confidential survey data for public release.We use a set of variables from the 1997 wave of the German IAB Establishment Panel and evaluate the quality of the approach by comparing results from an analysis by Zwick (Ger. Econ. Rev. 6(2), 155–184, 2005) with the original data with the results we achieve for the same analysis run on the dataset after the imputation procedure. The comparison shows that valid inferences can be obtained using the synthetic datasets in this context, while confidentiality is guaranteed for the survey participants.  相似文献   

20.
In order to guarantee confidentiality and privacy of firm-level data, statistical offices apply various disclosure limitation techniques. However, each anonymization technique has its protection limits such that the probability of disclosing the individual information for some observations is not minimized. To overcome this problem, we propose combining two separate disclosure limitation techniques, blanking and multiplication of independent noise, in order to protect the original dataset. The proposed approach yields a decrease in the probability of reidentifying/disclosing individual information and can be applied to linear and nonlinear regression models. We show how to combine the blanking method with the multiplicative measurement error method and how to estimate the model by combining the multiplicative Simulation-Extrapolation (M-SIMEX) approach from Nolte (, 2007) on the one side with the Inverse Probability Weighting (IPW) approach going back to Horwitz and Thompson (J. Am. Stat. Assoc. 47:663–685, 1952) and on the other side with matching methods, as an alternative to IPW, like the semiparametric M-Estimator proposed by Flossmann (, 2007). Based on Monte Carlo simulations, we show that multiplicative measurement error combined with blanking as a masking procedure does not necessarily lead to a severe reduction in the estimation quality, provided that its effects on the data generating process are known.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号