首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper we address the problem of protecting confidentiality in statistical tables containing sensitive information that cannot be disseminated. This is an issue of primary importance in practice. Cell Suppression is a widely-used technique for avoiding disclosure of sensitive information, which consists in suppressing all sensitive table entries along with a certain number of other entries, called complementary suppressions. Determining a pattern of complementary suppressions that minimizes the overall loss of information results into a difficult (i.e., -hard) optimization problem known as the Cell Suppression Problem. We propose here a different protection methodology consisting of replacing some table entries by appropriate intervals containing the actual value of the unpublished cells. We call this methodology Partial Cell Suppression, as opposed to the classical complete cell suppression. Partial cell suppression has the important advantage of reducing the overall information loss needed to protect the sensitive information. Also, the new method provides automatically auditing ranges for each unpublished cell, thus saving an often time-consuming task to the statistical office while increasing the information explicitly provided with the table. Moreover, we propose an efficient (i.e., polynomial-time) algorithm to find an optimal partial suppression solution. A preliminary computational comparison between partial and complete suppression methologies is reported, showing the advantages of the new approach. Finally, we address possible extensions leading to a unified complete/partial cell suppression framework.  相似文献   

2.
In this paper we discuss a new theoretical basis for perturbation methods. In developing this new theoretical basis, we define the ideal measures of data utility and disclosure risk. Maximum data utility is achieved when the statistical characteristics of the perturbed data are the same as that of the original data. Disclosure risk is minimized if providing users with microdata access does not result in any additional information. We show that when the perturbed values of the confidential variables are generated as independent realizations from the distribution of the confidential variables conditioned on the non-confidential variables, they satisfy the data utility and disclosure risk requirements. We also discuss the relationship between the theoretical basis and some commonly used methods for generating perturbed values of confidential numerical variables.  相似文献   

3.
Summary.  Top coding of extreme values of variables like income is a common method of statistical disclosure control, but it creates problems for the data analyst. The paper proposes two alternative methods to top coding for statistical disclosure control that are based on multiple imputation. We show in simulation studies that the multiple-imputation methods provide better inferences of the publicly released data than top coding, using straightforward multiple-imputation methods of analysis, while maintaining good statistical disclosure control properties. We illustrate the methods on data from the 1995 Chinese household income project.  相似文献   

4.
5.
Summary. Protection against disclosure is important for statistical agencies releasing microdata files from sample surveys. Simple measures of disclosure risk can provide useful evidence to support decisions about release. We propose a new measure of disclosure risk: the probability that a unique match between a microdata record and a population unit is correct. We argue that this measure has at least two advantages. First, we suggest that it may be a more realistic measure of risk than two measures that are currently used with census data. Second, we show that consistent inference (in a specified sense) may be made about this measure from sample data without strong modelling assumptions. This is a surprising finding, in its contrast with the properties of the two 'similar' established measures. As a result, this measure has potentially useful applications to sample surveys. In addition to obtaining a simple consistent predictor of the measure, we propose a simple variance estimator and show that it is consistent. We also consider the extension of inference to allow for certain complex sampling schemes. We present a numerical study based on 1991 census data for about 450 000 enumerated individuals in one area of Great Britain. We show that the theoretical results on the properties of the point predictor of the measure of risk and its variance estimator hold to a good approximation for these data.  相似文献   

6.
Since the 1920' sanduntil recently, numerical computation has beena limiting factor in the application of statistical methods to the improvement of product quality. This restriction is be in geliminated by the in troduction of computer tools for statistical quality control.

Both novice and expertusers of statistical methods can ben-efitsubstantially from the availability o intergrated soft ware for computing, graphics, and data management. Theuse of such soft ware in SQC training program senables the student to focus on the under standing of statistical techniques, rather than the irmechanicaldetails. In productionen vironments, properly designedinterfaces facilitated at aent ryand access to statistical soft ware by plant personnel, with out requiring know ledge of a computer language. The sesame tool scan be used by management to retrieve in for mationandob tain summaries and display sofcriti-caldata gatheredover different period softime. Finally, computer tool sprovide the applied statistician with agreater range of advanced methods, includinganalytical and graphical extensions of the traditional She whart control chart.  相似文献   

7.
The performance of Statistical Disclosure Control (SDC) methods for microdata (also called masking methods) is measured in terms of the utility and the disclosure risk associated to the protected microdata set. Empirical disclosure risk assessment based on record linkage stands out as a realistic and practical disclosure risk assessment methodology which is applicable to every conceivable masking method. The intruder is assumed to know an external data set, whose records are to be linked to those in the protected data set; the percent of correctly linked record pairs is a measure of disclosure risk. This paper reviews conventional record linkage, which assumes shared variables between the external and the protected data sets, and then shows that record linkage—and thus disclosure—is still possible without shared variables.  相似文献   

8.
ABSTRACT

In a sequence of elements, a run is defined as a maximal subsequence of like elements. The number of runs or the length of the longest run has been widely used to test the randomness of an ordered sequence. Based on two different sampling methods and two types of test statistics used, run tests can be classified into one of four cases. Numerous researchers have derived the probability distributions in many different ways, treating each case separately. In the paper, we propose a unified approach which is based on recurrence arguments of two mutually exclusive sub-sequences. We also consider the sequence of nominal data that has more than two classes. Thus, the traditional run tests for a binary sequence are special cases of our generalized run tests. We finally show that the generalized run tests can be applied to many quality management areas, such as testing changes in process variation, developing non-parametric multivariate control charts, and comparing the shapes and locations of more than two process distributions.  相似文献   

9.
10.
The problem of assessing the merit of a technique for displaying data is addressed. The divergence, a measure of the difference between two graphs, is introduced and its properties discussed. The performance of the divergence is investigated by Monte Carlo methods. Two applications of the divergence are presented.  相似文献   

11.
Because manufacturing lot sizes continue to shrink, statistical process control methods for short production runs are increasingly important. We review and comment on the assumptions, advantages and disadvantages of alternatives, Traditional methods well as more recent developments are described and contrasted.  相似文献   

12.
There is a growing demand for public use data while at the same time there are increasing concerns about the privacy of personal information. One proposed method for accomplishing both goals is to release data sets that do not contain real values but yield the same inferences as the actual data. The idea is to view confidential data as missing and use multiple imputation techniques to create synthetic data sets. In this article, we compare techniques for creating synthetic data sets in simple scenarios with a binary variable.  相似文献   

13.
Statistical process control tools have been used routinely to improve process capabilities through reliable on-line monitoring and diagnostic processes. In the present paper, we propose a novel multivariate control chart that integrates a support vector machine (SVM) algorithm, a bootstrap method, and a control chart technique to improve multivariate process monitoring. The proposed chart uses as the monitoring statistic the predicted probability of class (PoC) values from an SVM algorithm. The control limits of SVM-PoC charts are obtained by a bootstrap approach. A simulation study was conducted to evaluate the performance of the proposed SVM–PoC chart and to compare it with other data mining-based control charts and Hotelling's T 2 control charts under various scenarios. The results showed that the proposed SVM–PoC charts outperformed other multivariate control charts in nonnormal situations. Further, we developed an exponential weighed moving average version of the SVM–PoC charts for increasing sensitivity to small shifts.  相似文献   

14.
In this article we review the major areas of remote sensing in the Russian literature for the period 1976 to 1985 that use statistical methods to analyze the observed data. For each of the areas, the problems that have been studied and the statistical techniques that have been used are briefly described  相似文献   

15.
Using Markov chain representations, we evaluate and compare the performance of cumulative sum (CUSUM) and Shiryayev–Roberts methods in terms of the zero- and steady-state average run length and worst-case signal resistance measures. We also calculate the signal resistance values from the worst- to the best-case scenarios for both the methods. Our results support the recommendation that Shewhart limits be used with CUSUM and Shiryayev–Roberts methods, especially for low values of the size of the shift in the process mean for which the methods are designed to detect optimally.  相似文献   

16.
In this article, we compare alternative missing imputation methods in the presence of ordinal data, in the framework of CUB (Combination of Uniform and (shifted) Binomial random variable) models. Various imputation methods are considered, as are univariate and multivariate approaches. The first step consists of running a simulation study designed by varying the parameters of the CUB model, to consider and compare CUB models as well as other methods of missing imputation. We use real datasets on which to base the comparison between our approach and some general methods of missing imputation for various missing data mechanisms.  相似文献   

17.
The paper gives a review of a number of data models for aggregate statistical data which have appeared in the computer science literature in the last ten years.After a brief introduction to the data model in general, the fundamental concepts of statistical data are introduced. These are called statistical objects because they are complex data structures (vectors, matrices, relations, time series, etc) which may have different possible representations (e.g. tables, relations, vectors, pie-charts, bar-charts, graphs, and so on). For this reason a statistical object is defined by two different types of attribute (a summary attribute, with its own summary type and with its own instances, called summary data, and the set of category attributes, which describe the summary attribute). Some conceptual models of statistical data (CSM, SDM4S), some semantic models of statistical data (SCM, SAM*, OSAM*), and some graphical models of statistical data (SUBJECT, GRASS, STORM) are also discussed.  相似文献   

18.
In the manufacturing process, a sequence of measurements of quality characteristic is increasingly taken across some continuum, producing a curve that represents the quality of the item. This curve provides the so-called profile or functional data. Regardless of a linear or nonlinear profile, the common approaches of the control chart are based on the multivariate control chart by monitoring the estimated parameter of the pre-defined linear or nonlinear model. Usually, the model is difficult to know practically, and it is also difficult to identify the abnormal pattern from the outlying parameter. The functional data control chart we propose can provide a better solution to these problems. In the Monte-Carlo simulations, we show that the functional data control chart is sensitive when the underlying process status is changed. By applying the vertical density profile data, the new method exhibits a good performance.  相似文献   

19.
Principal components are useful for multivariate process control. Typically, the principal component variables are often selected to summarize the variation in the process data. We provide an analysis to select the principal component variables to be included in a multivariate control chart that incorporates the unique aspects of the process control problem (rather than using traditional principal component guidelines).  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号