首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
大数据时代,对不同来源的数据进行整合是开展数据分析的第一步.记录链接(record linkage)是数据整合的核心技术之一.记录链接涉及了统计学、计算机科学的相关知识.在欧美等发达国家,记录链接理论及实践已经有数十年的历史,但国内鲜有系统的研究.文章介绍了记录链接的基础统计模型——Fellegi-Sunter模型,归纳了实务应用流程,并介绍了应用案例,以期对我国的统计工作和大数据应用有所启迪.  相似文献   

2.
许福娇 《浙江统计》2011,(12):54-55
统计数据质量是统计工作的生命线,其内涵有了进一步拓宽。当前,统计方法制度不完善、指标缺乏同一性、数据源头单一性等导致统计数据质量不高,而行政记录在数据获取、审核评估、提高公信力等方面均能提高数据质量,但其在开发中也存在部门协调、数据衔接、数据传递等问题。建议建立健全相应法律、构建大统计体系、开发兼容性软件、做好数据管理工作。  相似文献   

3.
陆婷 《中国统计》2023,(1):71-74
<正>近年来,大数据引领的新一轮科技革命正向纵深推进,扎实推进统计现代化改革,加快构建现代化统计调查体系,是以习近平同志为核心的党中央对新时代统计事业发展的顶层设计和系统谋划。“十四五”时期统计现代化改革规划指出,推进部门行政记录在住户抽样调查中的应用工作,加强低收入群体、中等收入群体研究测算,探索建立共同富裕统计监测体系。运用大数据和部门行政记录,是对新时代住户调查工作提出的新要求,也是新时代背景下住户调查发展变革的必经之路。  相似文献   

4.
在现有的统计文献中,关于统计与行政记录的关系只有简单提及,没有给行政记录下一个准确的定义。我们认为,行政记录就是指官方在行使行政职权过程中对事物及其变化所作的文字描述、记载。行政记录一般以文字形式记载,其中也包含对事物数据现象的描述,民间的记载只有得到官方认可后才可转变为行政记录。各地方的地方志中对 描述是当地行政记录的最典型形式。在没有统计行政的时代,行政记录是对社会经济现象进行数量描述的最主要形式。  相似文献   

5.
徐蔼婷  杨玉香 《统计研究》2015,32(11):88-96
开展基于行政记录的人口普查被视为有效破解传统人口普查难题的途径之一,亦是顺应大数据时代充分挖掘人口行政记录资源的必然之选。本文较系统地阐述了基于行政记录人口普查方法的基本框架,尝试对“完全模式”人口普查和“组合模式”人口普查的实施步骤进行解析。基于此,本文选择芬兰、奥地利、瑞士、荷兰四个国家,分普查基本情况、行政记录类型选择和基本记录库形成、已存在统计记录系统基本结构、专门组织的抽样调查设计、不同系统间的链接途径、新人口统计信息质量评估方法等六个维度,对实施基于行政记录的人口普查方法进行了国家比较。  相似文献   

6.
在现有的统计文献中 ,还没有给行政记录下一个准确的定义。笔者认为 ,行政记录就是指官方在行使行政职权过程中对事物及其变化所作的描述、记载。各地方的地方志中对社会经济现象的描述是当地行政记录的最典型形式。在没有统计行政的时代 ,行政记录是对社会经济现象进行数量描述的最主要形式。随着市场经济的不断发展 ,充分运用行政记录作为统计资料来源的补充已显必要。统计与行政记录也有非常密切的联系。首先 ,从广义看 ,统计工作的成果也就是统计行政过程中的行政记录 ,统计行政取得的各种数据资料就是统计行政记录 ,同时行政记录也是一…  相似文献   

7.
金勇进  刘展 《统计研究》2016,33(3):11-17
利用大数据进行抽样,很多情况下抽样框的构造比较困难,使得抽取的样本属于非概率样本,难以将传统的抽样推断理论应用到非概率样本中,如何解决非概率抽样的统计推断问题,是大数据背景下抽样调查面临的严重挑战。本文提出了解决非概率抽样统计推断问题的基本思路:一是抽样方法,可以考虑基于样本匹配的样本选择、链接跟踪抽样方法等,使得到的非概率样本近似于概率样本,从而可采用概率样本的统计推断理论;二是权数的构造与调整,可以考虑基于伪设计、模型和倾向得分等方法得到类似于概率样本的基础权数;三是估计,可以考虑基于伪设计、模型和贝叶斯的混合概率估计。最后,以基于样本匹配的样本选择为例探讨了具体解决方法。  相似文献   

8.
目前 ,随着市场经济的不断发展 ,运用全面统计搜集统计资料已遇困难 ,为了搜集有关方面的数据 ,在运用其他方法的同时充分运用行政记录作为统计资料来源的补充。那么 ,统计与行政记录有何区别与联系呢 ?统计与行政记录的区别表现在 ,首先内涵不同 ,“统计”就是用数字表述事实 ,目前包括三方面内容 ,即统计工作、统计资料、统计学 ;而“行政记录”则是同时用文字和数字表述事实。其次 ,资料来源不同 ,统计资料来源于用统计方法搜集到的社会经济各方面的信息 ,统计方法主要有普查、全面统计、抽样调查等 ,以及随着科学技术的发展出现的数学模…  相似文献   

9.
随着政府、公众对人口数据关注度的日益增加,人口普查面临着巨大的挑战。如何减少登记过程中的误差和普查成本?如何使统计数据更具可靠性?一种新的普查方法——基于记录链接  相似文献   

10.
在社会经济统计中,行政记录的重要性日益被统计部门重视,其开发使用也成为学术研究热点。通过研究现有文献,本文系统综述了行政记录和政府统计之间的渊源、行政记录的优点、使用行政记录应注意的问题、行政记录在国内外政府统计中的使用实践和趋势以及使用行政记录改善政府统计的方向.  相似文献   

11.
Among the goals of statistical matching, a very important one is the estimation of the joint distribution of variables not jointly observed in a sample survey but separately available from independent sample surveys. The absence of joint information on the variables of interest leads to uncertainty about the data generating model since the available sample information is unable to discriminate among a set of plausible joint distributions. In the present paper a short review of the concept of uncertainty in statistical matching under logical constraints, as well as how to measure uncertainty for continuous variables is presented. The notion of matching error is related to an appropriate measure of uncertainty and a criterion of selecting matching variables by choosing the variables minimizing such an uncertainty measure is introduced. Finally, a method to choose a plausible joint distribution for the variables of interest via iterative proportional fitting algorithm is described. The proposed methodology is then applied to household income and expenditure data when extra sample information regarding the average propensity to consume is available. This leads to a reconstructed complete dataset where each record includes measures on income and expenditure.  相似文献   

12.
Probabilistic matching of records is widely used to create linked data sets for use in health science, epidemiological, economic, demographic and sociological research. Clearly, this type of matching can lead to linkage errors, which in turn can lead to bias and increased variability when standard statistical estimation techniques are used with the linked data. In this paper we develop unbiased regression parameter estimates to be used when fitting a linear model with nested errors to probabilistically linked data. Since estimation of variance components is typically an important objective when fitting such a model, we also develop appropriate modifications to standard methods of variance components estimation in order to account for linkage error. In particular, we focus on three widely used methods of variance components estimation: analysis of variance, maximum likelihood and restricted maximum likelihood. Simulation results show that our estimators perform reasonably well when compared to standard estimation methods that ignore linkage errors.  相似文献   

13.
There are now three essentially separate literatures on the topics of multiple systems estimation, record linkage, and missing data. But in practice the three are intimately intertwined. For example, record linkage involving multiple data sources for human populations is often carried out with the expressed goal of developing a merged database for multiple system estimation (MSE). Similarly, one way to view both the record linkage and MSE problems is as ones involving the estimation of missing data. This presentation highlights the technical nature of these interrelationships and provides a preliminary effort at their integration.  相似文献   

14.
The widely used Fellegi–Sunter model for probabilistic record linkage does not leverage information contained in field values and consequently leads to identical classification of match status regardless of whether records agree on rare or common values. Since agreement on rare values is less likely to occur by chance than agreement on common values, records agreeing on rare values are more likely to be matches. Existing frequency-based methods typically rely on knowledge of error probabilities associated with field values and frequencies of agreed field values among matches, often derived using prior studies or training data. When such information is unavailable, applications of these methods are challenging. In this paper, we propose a simple two-step procedure for frequency-based matching using the Fellegi–Sunter framework to overcome these challenges. Matching weights are adjusted based on frequency distributions of the agreed field values among matches and non-matches, estimated by the Fellegi–Sunter model without relying on prior studies or training data. Through a real-world application and simulation, our method is found to produce comparable or better performance than the unadjusted method. Furthermore, frequency-based matching provides greater improvement in matching accuracy when using poorly discriminating fields with diminished benefit as the discriminating power of matching fields increases.  相似文献   

15.
Patterns of consent: evidence from a general household survey   总被引:1,自引:0,他引:1  
Summary.  We analyse patterns of consent and consent bias in the context of a large general household survey, the 'Improving survey measurement of income and employment' survey, also addressing issues that arise when there are multiple consent questions. A multivariate probit regression model for four binary outcomes with two incidental truncations is used. We show that there are biases in consent to data linkage with benefit and tax credit administrative records that are held by the Department for Work and Pensions, and with wage and employment data held by employers. There are also biases in respondents' willingness and ability to supply their national insurance number. The biases differ according to the question that is considered. We also show that modelling questions on consent independently rather than jointly may lead to misleading inferences about consent bias. A positive correlation between unobservable individual factors affecting consent to Department for Work and Pensions record linkage and consent to employer record linkage is suggestive of a latent individual consent propensity.  相似文献   

16.
The performance of Statistical Disclosure Control (SDC) methods for microdata (also called masking methods) is measured in terms of the utility and the disclosure risk associated to the protected microdata set. Empirical disclosure risk assessment based on record linkage stands out as a realistic and practical disclosure risk assessment methodology which is applicable to every conceivable masking method. The intruder is assumed to know an external data set, whose records are to be linked to those in the protected data set; the percent of correctly linked record pairs is a measure of disclosure risk. This paper reviews conventional record linkage, which assumes shared variables between the external and the protected data sets, and then shows that record linkage—and thus disclosure—is still possible without shared variables.  相似文献   

17.
The probabilistic uncertainty in record linkage affects statistical analysis such as regression analysis of linked data. This paper considers Bayesian regression analysis with linked data and shows that despite using the usual normal regression analysis, the least squares type estimators of regression coefficients are not always adequate. A method is proposed in which the distribution of the response variable is used. This method is related to finite mixture analysis and leads to more accurate estimations. A simple approach has been proposed to increase the tractability and reduce the number of mixture components. A Monte Carlo simulation study is also performed to assess the proposed approach.  相似文献   

18.
A note on using the F-measure for evaluating record linkage algorithms   总被引:1,自引:0,他引:1  
Record linkage is the process of identifying and linking records about the same entities from one or more databases. Record linkage can be viewed as a classification problem where the aim is to decide whether a pair of records is a match (i.e. two records refer to the same real-world entity) or a non-match (two records refer to two different entities). Various classification techniques—including supervised, unsupervised, semi-supervised and active learning based—have been employed for record linkage. If ground truth data in the form of known true matches and non-matches are available, the quality of classified links can be evaluated. Due to the generally high class imbalance in record linkage problems, standard accuracy or misclassification rate are not meaningful for assessing the quality of a set of linked records. Instead, precision and recall, as commonly used in information retrieval and machine learning, are used. These are often combined into the popular F-measure, which is the harmonic mean of precision and recall. We show that the F-measure can also be expressed as a weighted sum of precision and recall, with weights which depend on the linkage method being used. This reformulation reveals that the F-measure has a major conceptual weakness: the relative importance assigned to precision and recall should be an aspect of the problem and the researcher or user, but not of the particular linkage method being used. We suggest alternative measures which do not suffer from this fundamental flaw.  相似文献   

19.
Estimation of the parameters of an exponential distribution based on record data has been treated by Samaniego and Whitaker [On estimating population characteristics from record-breaking observations, I. Parametric results, Naval Res. Logist. Q. 33 (1986), pp. 531–543] and Doostparast [A note on estimation based on record data, Metrika 69 (2009), pp. 69–80]. Recently, Doostparast and Balakrishnan [Optimal record-based statistical procedures for the two-parameter exponential distribution, J. Statist. Comput. Simul. 81(12) (2011), pp. 2003–2019] obtained optimal confidence intervals as well as uniformly most powerful tests for one- and two-sided hypotheses concerning location and scale parameters based on record data from a two-parameter exponential model. In this paper, we derive optimal statistical procedures including point and interval estimation as well as most powerful tests based on record data from a two-parameter Pareto model. For illustrative purpose, a data set on annual wages of a sample of production-line workers in a large industrial firm is analysed using the proposed procedures.  相似文献   

20.
In this paper we address estimation and prediction problems for extreme value distributions under the assumption that the only available data are the record values. We provide some properties and pivotal quantities, and derive unbiased estimators for the location and rate parameters based on these properties and pivotal quantities. In addition, we discuss mean-squared errors of the proposed estimators and exact confidence intervals for the rate parameter. In Bayesian inference, we develop objective Bayesian analysis by deriving non informative priors such as the Jeffrey, reference, and probability matching priors for the location and rate parameters. We examine the validity of the proposed methods through Monte Carlo simulations for various record values of size and present a real data set for illustration purposes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号