首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Small area statistics obtained from sample survey data provide a critical source of information used to study health, economic, and sociological trends. However, most large-scale sample surveys are not designed for the purpose of producing small area statistics. Moreover, data disseminators are prevented from releasing public-use microdata for small geographic areas for disclosure reasons; thus, limiting the utility of the data they collect. This research evaluates a synthetic data method, intended for data disseminators, for releasing public-use microdata for small geographic areas based on complex sample survey data. The method replaces all observed survey values with synthetic (or imputed) values generated from a hierarchical Bayesian model that explicitly accounts for complex sample design features, including stratification, clustering, and sampling weights. The method is applied to restricted microdata from the National Health Interview Survey and synthetic data are generated for both sampled and non-sampled small areas. The analytic validity of the resulting small area inferences is assessed by direct comparison with the actual data, a simulation study, and a cross-validation study.  相似文献   

2.
Summary.  Complex survey sampling is often used to sample a fraction of a large finite population. In general, the survey is conducted so that each unit (e.g. subject) in the sample has a different probability of being selected into the sample. For generalizability of the sample to the population, both the design and the probability of being selected into the sample must be incorporated in the analysis. In this paper we focus on non-standard regression models for complex survey data. In our motivating example, which is based on data from the Medical Expenditure Panel Survey, the outcome variable is the subject's 'total health care expenditures in the year 2002'. Previous analyses of medical cost data suggest that the variance is approximately equal to the mean raised to the power of 1.5, which is a non-standard variance function. Currently, the regression parameters for this model cannot be easily estimated in standard statistical software packages. We propose a simple two-step method to obtain consistent regression parameter and variance estimates; the method proposed can be implemented within any standard sample survey package. The approach is applicable to complex sample surveys with any number of stages.  相似文献   

3.
于力超  金勇进 《统计研究》2018,35(11):93-104
大规模抽样调查多采用复杂抽样设计,得到具有分层嵌套结构的调查数据集,其中不可避免会遇到数据缺失问题,针对分层结构含缺失数据集的插补策略目前鲜有研究。本文将Gibbs算法应用到分层含缺失数据集的多重插补过程中,分别研究了固定效应模型插补法和随机效应模型插补法,进而通过理论推导和数值模拟,在不同组内相关系数、群组规模、数据缺失比例等情形下,从参数估计结果的无偏性和有效性两方面,比较不同方法的插补效果,给出插补模型的选择建议。研究结果表明,采用随机效应模型作为插补模型时,得到的参数估计结果更准确,而固定效应模型作为插补模型操作相对简便,在数据缺失比例较小、组内相关系数较大、群组规模较大等情形下,可以采用固定效应插补模型,否则建议采用随机效应插补模型。  相似文献   

4.
Multilevel latent class analysis is conducive to providing more effective information on both individual and group typologies. However, model selection issues still need further investigation. Current study probed into issue of high-level class numeration for a more complex model using AIC, AIC3, BIC, and BIC*. Data simulation was conducted and its result was verified by empirical data. The result demonstrated that these criteria have a certain inclination relative to sample sizes. Sample size per group plays an evident role in improving accuracy of AIC3 and BIC. The complex model requires more sample size per group to ensure accurate class numeration.  相似文献   

5.
《统计学通讯:理论与方法》2012,41(16-17):3278-3300
Under complex survey sampling, in particular when selection probabilities depend on the response variable (informative sampling), the sample and population distributions are different, possibly resulting in selection bias. This article is concerned with this problem by fitting two statistical models, namely: the variance components model (a two-stage model) and the fixed effects model (a single-stage model) for one-way analysis of variance, under complex survey design, for example, two-stage sampling, stratification, and unequal probability of selection, etc. Classical theory underlying the use of the two-stage model involves simple random sampling for each of the two stages. In such cases the model in the sample, after sample selection, is the same as model for the population; before sample selection. When the selection probabilities are related to the values of the response variable, standard estimates of the population model parameters may be severely biased, leading possibly to false inference. The idea behind the approach is to extract the model holding for the sample data as a function of the model in the population and of the first order inclusion probabilities. And then fit the sample model, using analysis of variance, maximum likelihood, and pseudo maximum likelihood methods of estimation. The main feature of the proposed techniques is related to their behavior in terms of the informativeness parameter. We also show that the use of the population model that ignores the informative sampling design, yields biased model fitting.  相似文献   

6.
One of the principal sources of error in data collected from structured face-to-face interviews is the interviewer. The other major component of imprecision in survey estimates is sampling variance. It is rare, however, to find studies in which the complex sampling variance and the complex interviewer variance are both computed. This paper compares the relative impact of interviewer effects and sample design effects on survey precision by making use of an interpenetrated primary sampling unit–interviewer experiment which was designed by the authors for implementation in the second wave of the British Household Panel Study as part of its scientific programme. It also illustrates the use of a multilevel (hierarchical) approach in which the interviewer and sample design effects are estimated simultaneously while being incorporated in a substantive model of interest.  相似文献   

7.
A folded type model is developed for analysing compositional data. The proposed model involves an extension of the α‐transformation for compositional data and provides a new and flexible class of distributions for modelling data defined on the simplex sample space. Despite its rather seemingly complex structure, employment of the EM algorithm guarantees efficient parameter estimation. The model is validated through simulation studies and examples which illustrate that the proposed model performs better in terms of capturing the data structure, when compared to the popular logistic normal distribution, and can be advantageous over a similar model without folding.  相似文献   

8.
Statistical estimation and hypothesis testing are two ways in which the data of a sample can be made to yield information concerning the parameters of the population from which the sample is drawn. The most important applications of these two ways is the acceptance sampling methods used in industrial quality control, systems reliability and failure detection. The use of these sampling methods is connected with the risk of unnecessary rejection of satisfactory lots and the risk of acceptance of lots with defective units, these decisions can occur when selecting biased samples containing most defective units or no defective units despite the fact that the lot has the contrary. The aim of this research is to make this kind of decision an improbable or impossible event. The parameters and techniques determining the sampling methods must be correctly chosen. The formulation of an optimised statistical model of these problems is the basic condition of obtaining objective results. By the essence of statistical simulation the process of functioning of the complex system was used to represent a mathematically formulated model which was isomorphic in all essential aspects of the research objectives. This model was repeatedly tested to determine the required statistical characteristics, based on the complex stochastic process.  相似文献   

9.
本文提出了住户调查中受样本轮换作用的六个重要方面,结合国外住户调查的经验,对国内有关制度及研究中的问题进行了讨论。通过构建一个适合复杂样本分析的方差-成本模型,结合数据模拟和比较静态分析的方法,从六个方面综合考虑轮换样本的影响,从而得出有效样本轮换率和样本轮换频率的确定机制。  相似文献   

10.
Misclassifications in binary responses have long been a common problem in medical and health surveys. One way to handle misclassifications in clustered or longitudinal data is to incorporate the misclassification model through the generalized estimating equation (GEE) approach. However, existing methods are developed under a non-survey setting and cannot be used directly for complex survey data. We propose a pseudo-GEE method for the analysis of binary survey responses with misclassifications. We focus on cluster sampling and develop analysis strategies for analyzing binary survey responses with different forms of additional information for the misclassification process. The proposed methodology has several attractive features, including simultaneous inferences for both the response model and the association parameters. Finite sample performance of the proposed estimators is evaluated through simulation studies and an application using a real dataset from the Canadian Longitudinal Study on Aging.  相似文献   

11.
Summary.  Social data often contain missing information. The problem is inevitably severe when analysing historical data. Conventionally, researchers analyse complete records only. Listwise deletion not only reduces the effective sample size but also may result in biased estimation, depending on the missingness mechanism. We analyse household types by using population registers from ancient China (618–907 AD) by comparing a simple classification, a latent class model of the complete data and a latent class model of the complete and partially missing data assuming four types of ignorable and non-ignorable missingness mechanisms. The findings show that either a frequency classification or a latent class analysis using the complete records only yielded biased estimates and incorrect conclusions in the presence of partially missing data of a non-ignorable mechanism. Although simply assuming ignorable or non-ignorable missing data produced consistently similarly higher estimates of the proportion of complex households, a specification of the relationship between the latent variable and the degree of missingness by a row effect uniform association model helped to capture the missingness mechanism better and improved the model fit.  相似文献   

12.
We propose a unified approach to the estimation of regression parameters under double-sampling designs, in which a primary sample consisting of data on the rough or proxy measures for the response and/or explanatory variables as well as a validation subsample consisting of data on the exact measurements are available. We assume that the validation sample is a simple random subsample from the primary sample. Our proposal utilizes a specific parametric model to extract the partial information contained in the primary sample. The resulting estimator is consistent even if such a model is misspecified, and it achieves higher asymptotic efficiency than the estimator based only on the validation data. Specific cases are discussed to illustrate the application of the estimator proposed.  相似文献   

13.
This paper is mainly concerned with modelling data from degradation sample paths over time. It uses a general growth curve model with Box‐Cox transformation, random effects and ARMA(p, q) dependence to analyse a set of such data. A maximum likelihood estimation procedure for the proposed model is derived and future values are predicted, based on the best linear unbiased prediction. The paper compares the proposed model with a nonlinear degradation model from a prediction point of view. Forecasts of failure times with various data lengths in the sample are also compared.  相似文献   

14.
Summary.  Statistical agencies make changes to the data collection methodology of their surveys to improve the quality of the data collected or to improve the efficiency with which they are collected. For reasons of cost it may not be possible to estimate the effect of such a change on survey estimates or response rates reliably, without conducting an experiment that is embedded in the survey which involves enumerating some respondents by using the new method and some under the existing method. Embedded experiments are often designed for repeated and overlapping surveys; however, previous methods use sample data from only one occasion. The paper focuses on estimating the effect of a methodological change on estimates in the case of repeated surveys with overlapping samples from several occasions. Efficient design of an embedded experiment that covers more than one time point is also mentioned. All inference is unbiased over an assumed measurement model, the experimental design and the complex sample design. Other benefits of the approach proposed include the following: it exploits the correlation between the samples on each occasion to improve estimates of treatment effects; treatment effects are allowed to vary over time; it is robust against incorrectly rejecting the null hypothesis of no treatment effect; it allows a wide set of alternative experimental designs. This paper applies the methodology proposed to the Australian Labour Force Survey to measure the effect of replacing pen-and-paper interviewing with computer-assisted interviewing. This application considered alternative experimental designs in terms of their statistical efficiency and their risks to maintaining a consistent series. The approach proposed is significantly more efficient than using only 1 month of sample data in estimation.  相似文献   

15.
An important problem in statistical practice is the selection of a suitable statistical model. Several model selection strategies are available in the literature, having different asymptotic and small sample properties, depending on the characteristics of the data generating mechanism. These characteristics are difficult to check in practice and there is a need for a data-driven adaptive procedure to identify an appropriate model selection strategy for the data at hand. We call such an identification a model metaselection, and we base it on the analysis of recursive prediction residuals obtained from each strategy with increasing sample sizes. Graphical tools are proposed in order to study these recursive residuals. Their use is illustrated on real and simulated data sets. When necessary, an automatic metaselection can be performed by simply accumulating predictive losses. Asymptotic and small sample results are presented.  相似文献   

16.
Estimation of price indexes in the United States is generally based on complex rotating panel surveys. The sample for the Consumer Price Index, for example, is selected in three stages—geographic areas, establishments, and individual items—with 20% of the sample being replaced by rotation each year. At each period, a time series of data is available for use in estimation. This article examines how to best combine data for estimation of long-term and short-term changes and how to estimate the variances of the index estimators in the context of two-stage sampling. I extend the class of estimators, introduced by Valliant and Miller, of Laspeyres indexes formed using sample data collected from the current period back to a previous base period. Linearization estimators of variance for indexes of long-term and short-term change are derived. The theory is supported by an empirical simulation study using two-stage sampling of establishments and items from a population derived from U.S. Bureau of Labor Statistics data.  相似文献   

17.
Outliers in multilevel data   总被引:2,自引:0,他引:2  
This paper offers the data analyst a range of practical procedures for dealing with outliers in multilevel data. It first develops several techniques for data exploration for outliers and outlier analysis and then applies these to the detailed analysis of outliers in two large scale multilevel data sets from educational contexts. The techniques include the use of deviance reduction, measures based on residuals, leverage values, hierarchical cluster analysis and a measure called DFITS. Outlier analysis is more complex in a multilevel data set than in, say, a univariate sample or a set of regression data, where the concept of an outlying value is straightforward. In the multilevel situation one has to consider, for example, at what level or levels a particular response is outlying, and in respect of which explanatory variables; furthermore, the treatment of a particular response at one level may affect its status or the status of other units at other levels in the model.  相似文献   

18.
The logistic regression model has been widely used in the social and natural sciences and results from studies using this model can have significant policy impacts. Thus, confidence in the reliability of inferences drawn from these models is essential. The robustness of such inferences is dependent on sample size. The purpose of this article is to examine the impact of alternative data sets on the mean estimated bias and efficiency of parameter estimation and inference for the logistic regression model with observational data. A number of simulations are conducted examining the impact of sample size, nonlinear predictors, and multicollinearity on substantive inferences (e.g. odds ratios, marginal effects) when using logistic regression models. Findings suggest that small sample size can negatively affect the quality of parameter estimates and inferences in the presence of rare events, multicollinearity, and nonlinear predictor functions, but marginal effects estimates are relatively more robust to sample size.  相似文献   

19.
In this paper we present methods for inference on data selected by a complex sampling design for a class of statistical models for the analysis of ordinal variables. Specifically, assuming that the sampling scheme is not ignorable, we derive for the class of cub models (Combination of discrete Uniform and shifted Binomial distributions) variance estimates for a complex two stage stratified sample. Both Taylor linearization and repeated replication variance estimators are presented. We also provide design‐based test diagnostics and goodness‐of‐fit measures. We illustrate by means of real data analysis the differences between survey‐weighted and unweighted point estimates and inferences for cub model parameters.  相似文献   

20.
ABSTRACT

Inflated data are prevalent in many situations and a variety of inflated models with extensions have been derived to fit data with excessive counts of some particular responses. The family of information criteria (IC) has been used to compare the fit of models for selection purposes. Yet despite the common use in statistical applications, there are not too many studies evaluating the performance of IC in inflated models. In this study, we studied the performance of IC for data with dual-inflated data. The new zero- and K-inflated Poisson (ZKIP) regression model and conventional inflated models including Poisson regression and zero-inflated Poisson (ZIP) regression were fitted for dual-inflated data and the performance of IC were compared. The effect of sample sizes and the proportions of inflated observations towards selection performance were also examined. The results suggest that the Bayesian information criterion (BIC) and consistent Akaike information criterion (CAIC) are more accurate than the Akaike information criterion (AIC) in terms of model selection when the true model is simple (i.e. Poisson regression (POI)). For more complex models, such as ZIP and ZKIP, the AIC was consistently better than the BIC and CAIC, although it did not reach high levels of accuracy when sample size and the proportion of zero observations were small. The AIC tended to over-fit the data for the POI, whereas the BIC and CAIC tended to under-parameterize the data for ZIP and ZKIP. Therefore, it is desirable to study other model selection criteria for dual-inflated data with small sample size.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号