期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Source Data Perturbation and consistent sets of safe tables

Cuppen Menno Willenborg Leon 《Statistics and Computing》2003,13(4):355-362

When tables are generated from a data file, the release of those tables should not reveal too detailed information concerning individual respondents. The disclosure of individual respondents in the microdata file can be prevented by applying disclosure control methods at the table level (by cell suppression or cell perturbation), but this may create inconsistencies among other tables based on the same data file. Alternatively, disclosure control methods can be applied at the microdata level, but these methods may change the data permanently and do not account for specific table properties. These problems can be circumvented by assigning a (single and fixed) weight factor to each respondent/record in the microdata file. Normally this weight factor is equal to 1 for each record, and is not explicitly incorporated in the microdata file. Upon tabulation, each contribution of a respondent is weighted multiplicatively by the respondent's weight factor. This approach is called Source Data Perturbation (SDP) because the data is perturbed at the microdata level, not at the table level. It should be noted, however, that the data in the original microdata is not changed; only a weight variable is added. The weight factors can be chosen in accordance with the SDC paradigm, i.e. such that the tables generated from the microdata are safe, and the information loss is minimized. The paper indicates how this can be done. Moreover it is shown that the SDP approach is very suitable for use in data warehouses, as the weights can be conveniently put in the fact tables. The data can then still be accessed and sliced and diced up to a certain level of detail, and tables generated from the data warehouse are mutually consistent and safe. 相似文献

2.

Multidimensional contingency tables with missing data

Zhi Geng 《统计学通讯:理论与方法》2013,42(12):4137-4146

This paper is concerned wim ine maximum likelihood estimation and the likelihood ratio test for hierarchical loglinear models of multidimensional contingency tables with missing data. The problems of estimation and test for a high dimensional contingency table can be reduced into those for a class of low dimensional tables. In some cases, the incomplete data in the high dimensional table can become complete in the low dimensional tables through the reduction can indicate how much the incomplete data contribute to the estimation and the test. 相似文献

3.

Bayesian disclosure risk assessment: predicting small frequencies in contingency tables

Jonathan J. Forster Emily L. Webb 《Journal of the Royal Statistical Society. Series C, Applied statistics》2007,56(5):551-570

Summary. We propose an approach for assessing the risk of individual identification in the release of categorical data. This requires the accurate calculation of predictive probabilities for those cells in a contingency table which have small sample frequencies, making the problem somewhat different from usual contingency table estimation, where interest is generally focused on regions of high probability. Our approach is Bayesian and provides posterior predictive probabilities of identification risk. By incorporating model uncertainty in our analysis, we can provide more realistic estimates of disclosure risk for individual cell counts than are provided by methods which ignore the multivariate structure of the data set. 相似文献

4.

Point-symmetry models and decomposition for collapsed square contingency tables

Kouji Yamamoto Shota Murakami Sadao Tomizawa 《Journal of applied statistics》2013,40(7):1446-1452

For square contingency tables with ordered categories, there may be some cases that one wants to analyze them by considering collapsed tables with some adjacent categories combined in the original table. This paper proposes three kinds of new models which have the structure of point-symmetry (PS), quasi point-symmetry and marginal point-symmetry for collapsed square tables. This paper also gives a decomposition of the PS model for collapsed square tables. The father's and his daughter's occupational mobility data are analyzed using new models. 相似文献

5.

Independence in Contingency Tables Using Simplicial Geometry

Juan José Egozcue Vera Pawlowsky-Glahn Matthias Templ Karel Hron 《统计学通讯:理论与方法》2013,42(18):3978-3996

Frequently, contingency tables are generated in a multinomial sampling. Multinomial probabilities are then organized in a table assigning probabilities to each cell. A probability table can be viewed as an element in the simplex. The Aitchison geometry of the simplex identifies independent probability tables as a linear subspace. An important consequence is that, given a probability table, the nearest independent table is obtained by orthogonal projection onto the independent subspace. The nearest independent table is identified as that obtained by the product of geometric marginals, which do not coincide with the standard marginals, except in the independent case. The original probability table is decomposed into orthogonal tables, the independent and the interaction tables. The underlying model is log-linear, and a procedure to test independence of a contingency table, based on a multinomial simulation, is developed. Its performance is studied on an illustrative example. 相似文献

6.

Measure of departure from symmetry for the analysis of collapsed square contingency tables with ordered categories

Kouji Yamamoto Fumika Shimada Sadao Tomizawa 《Journal of applied statistics》2015,42(4):866-875

For square contingency tables with ordered categories, there may be some cases that one wants to analyze them by considering collapsed tables with some adjacent categories combined in the original table. This paper considers the symmetry model for collapsed square contingency tables and proposes a measure to represent the degree of departure from symmetry. The proposed measure is defined as the arithmetic mean of submeasures each of which represents the degree of departure from symmetry for each collapsed 3×3 table. Each submeasure also represents the mean of power-divergence or diversity index for each collapsed table. Examples are given. 相似文献

7.

An Example of the Large Sample Behavior of the Midrange

James D. Broffitt 《The American statistician》2013,67(2):69-70

A data set in the form of a 2 × 2 × 2 contingency table is presented and analyzed in detail. For instructional purposes, the analysis of the data can be used to illustrate some basic concepts in the loglinear model approach to the analysis of multidimensional contingency tables. 相似文献

8.

Logratio approach to statistical analysis of 2×2 compositional tables

K. Fačevicová K. Hron V. Todorov D. Guo M. Templ 《Journal of applied statistics》2014,41(5):944-958

Compositional tables represent a continuous counterpart to well-known contingency tables. Their cells contain quantitatively expressed relative contributions of a whole, carrying exclusively relative information and are popularly represented in proportions or percentages. The resulting factors, corresponding to rows and columns of the table, can be inspected similarly as with contingency tables, e.g. for their mutual independent behaviour. The nature of compositional tables requires a specific geometrical treatment, represented by the Aitchison geometry on the simplex. The properties of the Aitchison geometry allow a decomposition of the original table into its independent and interactive parts. Moreover, the specific case of 2×2 compositional tables allows the construction of easily interpretable orthonormal coordinates (resulting from the isometric logratio transformation) for the original table and its decompositions. Consequently, for a sample of compositional tables both explorative statistical analysis like graphical inspection of the independent and interactive parts or any statistical inference (odds-ratio-like testing of independence) can be performed. Theoretical advancements of the presented approach are demonstrated using two economic applications. 相似文献

9.

Bayesian models for relative archaeological chronology building 总被引：1，自引：0，他引：1

Caitlin E. Buck & Sujit K. Sahu 《Journal of the Royal Statistical Society. Series C, Applied statistics》2000,49(4):423-440

For many years, archaeologists have postulated that the numbers of various artefact types found within excavated features should give insight about their relative dates of deposition even when stratigraphic information is not present. A typical data set used in such studies can be reported as a cross-classification table (often called an abundance matrix or, equivalently, a contingency table) of excavated features against artefact types. Each entry of the table represents the number of a particular artefact type found in a particular archaeological feature. Methodologies for attempting to identify temporal sequences on the basis of such data are commonly referred to as seriation techniques. Several different procedures for seriation including both parametric and non-parametric statistics have been used in an attempt to reconstruct relative chronological orders on the basis of such contingency tables. We develop some possible model-based approaches that might be used to aid in relative, archaeological chronology building. We use the recently developed Markov chain Monte Carlo method based on Langevin diffusions to fit some of the models proposed. Predictive Bayesian model choice techniques are then employed to ascertain which of the models that we develop are most plausible. We analyse two data sets taken from the literature on archaeological seriation. 相似文献

10.

Exact tests for two-way symmetriccontingency tables

McDONALD JOHN W. DeROURE DAVID C. MICHAELIDES DANIUS T. 《Statistics and Computing》1998,8(4):391-399

A two-way contingency table in which both variables have the same categories is termed a symmetric table. In many applications, because of the social processes involved, most of the observations lie on the main diagonal and the off-diagonal counts are small. For these tables, the model of independence is implausible and interest is then focussed on the off-diagonal cells and the models of quasi-independence and quasi-symmetry. For ordinal variables, a linear-by-linear association model can be used to model the interaction structure. For sparse tables, large-sample goodness-of-fit tests are often unreliable and one should use an exact test. In this paper, we review exact tests and the computing problems involved. We propose new recursive algorithms for exact goodness-of-fit tests of quasi-independence, quasi-symmetry, linear-by-linear association and some related models. We propose that all computations be carried out using symbolic computation and rational arithmetic in order to calculate the exact p-values accurately and describe how we implemented our proposals. Two examples are presented. 相似文献

11.

Detection of outlying proportions

Flavio Mignone 《Journal of applied statistics》2018,45(8):1382-1395

In this paper we introduce a new method for detecting outliers in a set of proportions. It is based on the construction of a suitable two-way contingency table and on the application of an algorithm for the detection of outlying cells in such table. We exploit the special structure of the relevant contingency table to increase the efficiency of the method. The main properties of our algorithm, together with a guide for the choice of the parameters, are investigated through simulations, and in simple cases some theoretical justifications are provided. Several examples on synthetic data and an example based on pseudo-real data from biological experiments demonstrate the good performances of our algorithm. 相似文献

12.

Simultaneous edit-imputation and disclosure limitation for business establishment data

Hang J. Kim Jerome P. Reiter Alan F. Karr 《Journal of applied statistics》2018,45(1):63-82

Business establishment microdata typically are required to satisfy agency-specified edit rules, such as balance equations and linear inequalities. Inevitably some establishments' reported data violate the edit rules. Statistical agencies correct faulty values using a process known as edit-imputation. Business establishment data also must be heavily redacted before being shared with the public; indeed, confidentiality concerns lead many agencies not to share establishment microdata as unrestricted access files. When microdata must be heavily redacted, one approach is to create synthetic data, as done in the U.S. Longitudinal Business Database and the German IAB Establishment Panel. This article presents the first implementation of a fully integrated approach to edit-imputation and data synthesis. We illustrate the approach on data from the U.S. Census of Manufactures and present a variety of evaluations of the utility of the synthetic data. The paper also presents assessments of disclosure risks for several intruder attacks. We find that the synthetic data preserve important distributional features from the post-editing confidential microdata, and have low risks for the various attacks. 相似文献

13.

On the small sample estimation of the log odds ratio in 2×2 contingency tables with one set of fixed margins

《Journal of Statistical Computation and Simulation》2012,82(3-4):305-320

For the analysis of 2 × 2 contingency tables with one set of fixed margins, a number of authors (e.g. Wolf, 1955; Cox, 1970) have proposed the use of various modified estimators based upon the empirical logistic transform. In this paper the moments of such estimators are considered and their small sample properties are investigated numerically. 相似文献

14.

Likelihood Ratio Test Against Stochastic Order in Three Way Contingency Tables

Yanqin Feng Jinde Wang 《统计学通讯:理论与方法》2013,42(1):81-96

Trend tests in dose-response have been central problems in medicine. The likelihood ratio test is often used to test hypotheses involving a stochastic order. Stratified contingency tables are common in practice. The distribution theory of likelihood ratio test has not been full developed for stratified tables and more than two stochastically ordered distributions. Under c strata of m × r tables, for testing the conditional independence against simple stochastic order alternative, this article introduces a model-free test method and gives the asymptotic distribution of the test statistic, which is a chi-bar-squared distribution. A real data set concerning an ordered stratified table will be used to show the validity of this test method. 相似文献

15.

Masking methods that preserve positivity constraints in microdata

Anna Oganian Alan F. Karr 《Journal of statistical planning and inference》2011,141(1):31-41

Statistical agencies have conflicting obligations to protect confidential information provided by respondents to surveys or censuses and to make data available for research and planning activities. When the microdata themselves are to be released, in order to achieve these conflicting objectives, statistical agencies apply statistical disclosure limitation (SDL) methods to the data, such as noise addition, swapping or microaggregation. Some of these methods do not preserve important structure and constraints in the data, such as positivity of some attributes or inequality constraints between attributes. Failure to preserve constraints is not only problematic in terms of data utility, but also may increase disclosure risk.In this paper, we describe a method for SDL that preserves both positivity of attributes and the mean vector and covariance matrix of the original data. The basis of the method is to apply multiplicative noise with the proper, data-dependent covariance structure. 相似文献

16.

A theoretical basis for perturbation methods

Muralidhar Krishnamurty Sarathy Rathindra 《Statistics and Computing》2003,13(4):329-335

In this paper we discuss a new theoretical basis for perturbation methods. In developing this new theoretical basis, we define the ideal measures of data utility and disclosure risk. Maximum data utility is achieved when the statistical characteristics of the perturbed data are the same as that of the original data. Disclosure risk is minimized if providing users with microdata access does not result in any additional information. We show that when the perturbed values of the confidential variables are generated as independent realizations from the distribution of the confidential variables conditioned on the non-confidential variables, they satisfy the data utility and disclosure risk requirements. We also discuss the relationship between the theoretical basis and some commonly used methods for generating perturbed values of confidential numerical variables. 相似文献

17.

The Box–Cox Transformation and Non‐Iterative Estimation Methods for Ordinal Log‐Linear Models

Eric J. Beh Thomas B. Farver 《Australian & New Zealand Journal of Statistics》2012,54(4):475-484

Recently Beh and Farver investigated and evaluated three non‐iterative procedures for estimating the linear‐by‐linear parameter of an ordinal log‐linear model. The study demonstrated that these non‐iterative techniques provide estimates that are, for most types of contingency tables, statistically indistinguishable from estimates from Newton's unidimensional algorithm. Here we show how two of these techniques are related using the Box–Cox transformation. We also show that by using this transformation, accurate non‐iterative estimates are achievable even when a contingency table contains sampling zeros. 相似文献

18.

Testing stochastic orders in tails of contingency tables

Chi Tim Ng Kyu S. Hahn 《Journal of applied statistics》2011,38(6):1133-1149

Testing for the difference in the strength of bivariate association in two independent contingency tables is an important issue that finds applications in various disciplines. Currently, many of the commonly used tests are based on single-index measures of association. More specifically, one obtains single-index measurements of association from two tables and compares them based on asymptotic theory. Although they are usually easy to understand and use, often much of the information contained in the data is lost with single-index measures. Accordingly, they fail to fully capture the association in the data. To remedy this shortcoming, we introduce a new summary statistic measuring various types of association in a contingency table. Based on this new summary statistic, we propose a likelihood ratio test comparing the strength of association in two independent contingency tables. The proposed test examines the stochastic order between summary statistics. We derive its asymptotic null distribution and demonstrate that the least favorable distributions are chi-bar distributions. We numerically compare the power of the proposed test to that of the tests based on single-index measures. Finally, we provide two examples illustrating the new summary statistics and the related tests. 相似文献

19.

Quasi local odds symmetry model for square contingency table with ordinal categories

Gökçen Altun 《Journal of Statistical Computation and Simulation》2019,89(15):2899-2913

This paper proposes a new model for square contingency tables. The proposed model tests the equality of local odds ratios between the one side of the main diagonal and corresponding other side and it represents the non-symmetric structure of the square contingency table. The proposed model is compared with twenty-five models introduced for analysing the square contingency tables for both symmetric and non-symmetric structures. The results show that the proposed model provides best fit performance than other existing models for square contingency tables. 相似文献

20.

Preserving data utility via BART

Xinlei Wang Alan F. Karr 《Journal of statistical planning and inference》2010

When preparing data for public release, information organizations face the challenge of preserving the quality of data while protecting the confidentiality of both data subjects and sensitive data attributes. Without knowing what type of analyses will be conducted by data users, it is often hard to alter data without sacrificing data utility. In this paper, we propose a new approach to mitigate this difficulty, which entails using Bayesian additive regression trees (BART), in connection with existing methods for statistical disclosure limitation, to help preserve data utility while meeting confidentiality requirements. We illustrate the performance of our method through both simulation and a data example. The method works well when the targeted relationship underlying the original data is not weak, and the performance appears to be robust to the intensity of alteration. 相似文献