首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 984 毫秒
1.
Owing to the growing concerns over data confidentiality, many national statistical agencies are considering remote access servers to disseminate data to the public. With remote servers, users submit requests for output from statistical models fit using the collected data, but they are not allowed access to the data. Remote servers also should enable users to check the fit of their models; however, standard diagnostics like residuals or influence statistics can disclose individual data values. In this article, we present diagnostics for categorical data regressions that can be safely and usefully employed in remote servers. We illustrate the diagnostics with simulation studies.  相似文献   

2.
Business establishment microdata typically are required to satisfy agency-specified edit rules, such as balance equations and linear inequalities. Inevitably some establishments' reported data violate the edit rules. Statistical agencies correct faulty values using a process known as edit-imputation. Business establishment data also must be heavily redacted before being shared with the public; indeed, confidentiality concerns lead many agencies not to share establishment microdata as unrestricted access files. When microdata must be heavily redacted, one approach is to create synthetic data, as done in the U.S. Longitudinal Business Database and the German IAB Establishment Panel. This article presents the first implementation of a fully integrated approach to edit-imputation and data synthesis. We illustrate the approach on data from the U.S. Census of Manufactures and present a variety of evaluations of the utility of the synthetic data. The paper also presents assessments of disclosure risks for several intruder attacks. We find that the synthetic data preserve important distributional features from the post-editing confidential microdata, and have low risks for the various attacks.  相似文献   

3.
To limit the risks of disclosures when releasing data to the public, it has been suggested that statistical agencies release multiply imputed, synthetic microdata. For example, the released microdata can be fully synthetic, comprising random samples of units from the sampling frame with simulated values of variables. Or, the released microdata can be partially synthetic, comprising the units originally surveyed with some collected values, e.g. sensitive values at high risk of disclosure or values of key identifiers, replaced with multiple imputations. This article presents inferential methods for synthetic data for multi-component estimands, in particular procedures for Wald and likelihood ratio tests. The performance of the procedures is illustrated with simulation studies.  相似文献   

4.
This paper considers residuals for time series regression. Despite much literature on visual diagnostics for uncorrelated data, there is little on the autocorrelated case. To examine various aspects of the fitted time series regression model, three residuals are considered. The fitted regression model can be checked using orthogonal residuals; the time series error model can be analysed using marginal residuals; and the white noise error component can be tested using conditional residuals. When used together, these residuals allow identification of outliers, model mis‐specification and mean shifts. Due to the sensitivity of conditional residuals to model mis‐specification, it is suggested that the orthogonal and marginal residuals be examined first.  相似文献   

5.
This paper addresses the issue of when residuals from failure time models, which are useful in model validation and diagnostics, possess a conditional ancillarity property. This property states that the distribution of the residuals depends on the model parameters only through a many-to-one function of these parameters, which in certain models turn out to be the censoring proportion. Concrete results are obtained for models which possess an invariance structure, and these results are applied to commonly used failure time models. Aside from furthering our understanding of the distributional structure of residuals, this conditional ancillarity property can be exploited to study in a more efficient manner the distributional properties of residuals either analytically and/or through numerical methods.  相似文献   

6.
Small area statistics obtained from sample survey data provide a critical source of information used to study health, economic, and sociological trends. However, most large-scale sample surveys are not designed for the purpose of producing small area statistics. Moreover, data disseminators are prevented from releasing public-use microdata for small geographic areas for disclosure reasons; thus, limiting the utility of the data they collect. This research evaluates a synthetic data method, intended for data disseminators, for releasing public-use microdata for small geographic areas based on complex sample survey data. The method replaces all observed survey values with synthetic (or imputed) values generated from a hierarchical Bayesian model that explicitly accounts for complex sample design features, including stratification, clustering, and sampling weights. The method is applied to restricted microdata from the National Health Interview Survey and synthetic data are generated for both sampled and non-sampled small areas. The analytic validity of the resulting small area inferences is assessed by direct comparison with the actual data, a simulation study, and a cross-validation study.  相似文献   

7.
In this study, we develop the adjusted deviance residuals for the gamma regression model (GRM) by following Cordeiro's (2004) method. These adjusted deviance residuals under the GRM are used for influence diagnostics. A comparative analysis has been sorted out between our proposed method of the adjusted deviance residuals and an existing method for influence diagnostics. These results are illustrated by a simulation study and using a real data set. They are presented for different values of dispersion and sample sizes and indicate the significant role of the GRM inferences.  相似文献   

8.
A general theory is presented for residuals from the general linear model with correlated errors. It is demonstrated that there are two fundamental types of residual associated with this model, referred to here as the marginal and the conditional residual. These measure respectively the distance to the global aspects of the model as represented by the expected value and the local aspects as represented by the conditional expected value. These residuals may be multivariate. Some important dualities are developed which have simple implications for diagnostics. The results are illustrated by reference to model diagnostics in time series and in classical multivariate analysis with independent cases.  相似文献   

9.
Recent advances in computing make it practical to use complex hierarchical models. However, the complexity makes it difficult to see how features of the data determine the fitted model. This paper describes an approach to diagnostics for hierarchical models, specifically linear hierarchical models with additive normal or t -errors. The key is to express hierarchical models in the form of ordinary linear models by adding artificial `cases' to the data set corresponding to the higher levels of the hierarchy. The error term of this linear model is not homoscedastic, but its covariance structure is much simpler than that usually used in variance component or random effects models. The re-expression has several advantages. First, it is extremely general, covering dynamic linear models, random effect and mixed effect models, and pairwise difference models, among others. Second, it makes more explicit the geometry of hierarchical models, by analogy with the geometry of linear models. Third, the analogy with linear models provides a rich source of ideas for diagnostics for all the parts of hierarchical models. This paper gives diagnostics to examine candidate added variables, transformations, collinearity, case influence and residuals.  相似文献   

10.
In this paper we develop multiple case deletion statistics for the general linear model so that a residual vector and a leverage matrix are identified which have roles analogous to residuals and leverage for ordinary least squares models. We extend the notion of the conditional deletion diagnostic to general linear models. The residuals, leverage and deletion diagnostics are illustrated with data modelled by a linear growth curve.  相似文献   

11.
Statistical Agencies manage huge amounts of microdata. The main task of these agencies is to provide a variety of users with general information about for instance the population and the economy. However, in some cases users request additional, more specific information. Many agencies have therefore set up facilities that enable selected users to obtain tailor-made statistical information.A remote access system is an example of such a facility where users can submit queries for statistical information from their own computer. These queries are handled by the statistical agency and the generated, possibly confidentialised, output is returned to the user. This way the agency still keeps control over its own data while the user does not need to make frequent visits to the agency.For some years, the Luxembourg Income Study (LIS) and Luxembourg Employment Study (LES) have made use of an advanced remote access system. At Statistics Netherlands and at other statistical institutes recently the need for a similar system has been expressed. In this article, we discuss the characteristics, limitations and desired properties of a remote access system. We illustrate the discussion by the system used at LIS/LES.  相似文献   

12.
In this paper we discuss methodology for the safe release of business microdata. In particular we extend the model-based protection procedure of Franconi and Stander (2002, The Statistician 51: 1–11) by allowing the model to take account of the spatial structure underlying the geographical information in the microdata. We discuss the use of the Gibbs sampler for performing the computations required by this spatial approach. We provide an empirical comparison of these non-spatial and spatial disclosure limitation methods based on the Italian sample from the Community Innovation Survey. We quantify the level of protection achieved for the released microdata and the error induced when various inferences are performed. We find that although the spatial method often induces higher inferential errors, it almost always provides more protection. Moreover the aggregated areas from the spatial procedure can be somewhat more spatially smooth, and hence possibly more meaningful, than those from the non-spatial approach. We discuss possible applications of these model-based protection procedures to more spatially extensive data sets.  相似文献   

13.
ABSTRACT

Constrained general linear models (CGLMs) have wide applications in practice. Similar to other data analysis, the identification of influential observations that may be potential outliers is an important step beyond in the CGLMs. We develop multiple case-deletion diagnostics for detecting influential observations in the CGLMs. The diagnostics are functions of basic building blocks: studentized residuals, error contrast matrix, and the inverse of the response variable covariance matrix. The basic building blocks are computed only once from the complete data analysis and provide information on the influence of the data on different aspects of the model fit. Computational formulas are given which make the procedures feasible. An illustrative example with a real data set is also reported.  相似文献   

14.
In this paper we discuss a new theoretical basis for perturbation methods. In developing this new theoretical basis, we define the ideal measures of data utility and disclosure risk. Maximum data utility is achieved when the statistical characteristics of the perturbed data are the same as that of the original data. Disclosure risk is minimized if providing users with microdata access does not result in any additional information. We show that when the perturbed values of the confidential variables are generated as independent realizations from the distribution of the confidential variables conditioned on the non-confidential variables, they satisfy the data utility and disclosure risk requirements. We also discuss the relationship between the theoretical basis and some commonly used methods for generating perturbed values of confidential numerical variables.  相似文献   

15.
The added variable plot is useful for examining the effect of a covariate in regression models. The plot provides information regarding the inclusion of a covariate, and is useful in identifying influential observations on the parameter estimates. Hall et al. (1996) proposed a plot for Cox's proportional hazards model derived by regarding the Cox model as a generalized linear model. This paper proves and discusses properties of this plot. These properties make the plot a valuable tool in model evaluation. Quantities considered include parameter estimates, residuals, leverage, case influence measures and correspondence to previously proposed residuals and diagnostics.  相似文献   

16.
To assess the influence of observations on the parameter estimates, case deletion diagnostics are commonly used in linear regression models. For linear models with correlated errors we study the influence of observations on testing a linear hypothesis using single and multiple case deletions. The change in likelihood ratio test and F test theoretically is derived and it is shown these tests to be completely determined by two proposed generalized externally studentized residuals. An illustrative example of a real data set is also reported.  相似文献   

17.
This paper presents influence diagnostics for simultaneous equations models. It proposes residuals, leverage and other influence measures. A missing data method is adopted to minimize the masking effect due to case deletions. The assessment of local influence is also considered. The paper shows how to evaluate the effects that perturbations to the endogenous variables, predetermined variables and case weights may have on the parameter estimates. The diagnostics are illustrated with two examples.  相似文献   

18.
19.
A common approach to building control charts for autocorrelated data is to apply classical SPC to the residuals from a time series model of the process. However, Shewhart charts and even CUSUM charts are less sensitive to small shifts in the process mean when applied to residuals than when applied to independent data. Using an approximate analytical model, we show that the average run length of a CUSUM chart for residuals can be reduced substantially by modifying traditional chart design guidelines to account for the degree of autocorrelation in the data.  相似文献   

20.
Using a spectral approach, the authors propose tests to detect multivariate ARCH effects in the residuals from a multivariate regression model. The tests are based on a comparison, via a quadratic norm, between the uniform density and a kernel‐based spectral density estimator of the squared residuals and cross products of residuals. The proposed tests are consistent under an arbitrary fixed alternative. The authors present a new application of the test due to Hosking (1980) which is seen to be a special case of their approach involving the truncated uniform kernel. However, they typically obtain more powerful procedures when using a different weighting. The authors consider especially the procedure of Robinson (1991) for choosing the smoothing parameter of the spectral density estimator. They also introduce a generalized version of the test for ARCH effects due to Ling & Li (1997). They investigate the finite‐sample performance of their tests and compare them to existing tests including those of Ling & Li (1997) and the residual‐based diagnostics of Tse (2002).Finally, they present a financial application.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号