首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
黄恒君 《统计研究》2019,36(7):3-12
大数据在统计生产中潜力巨大,有助于构建高质量的统计生产体系,但符合统计生产目标的数据源特征及其数据质量问题有待明确。本文在寻求大数据源与传统统计数据源共同点的基础上,讨论了统计生产中的大数据源及其数据质量问题,进而探讨了大数据与传统统计生产融合应用。首先从数据生成流程及数据特征两个方面论证并限定了可用于统计生产的大数据源;然后在广义数据质量框架下讨论了大数据统计生产中的数据质量问题,梳理了大数据统计生产流程的数据质量控制要点和质量缺陷;最后根据数据质量分析结果,提出了将大数据融入传统调查的统计体系构建思路。  相似文献   

2.
Summary.  The process of quality control of micrometeorological and carbon dioxide (CO2) flux data can be subjective and may lack repeatability, which would undermine the results of many studies. Multivariate statistical methods and time series analysis were used together and independently to detect and replace outliers in CO2 flux data derived from a Bowen ratio energy balance system. The results were compared with those produced by five experts who applied the current and potentially subjective protocol. All protocols were tested on the same set of three 5-day periods, when measurements were conducted in an abandoned agricultural field. The concordance of the protocols was evaluated by using the experts' opinion (mean ± 1.96 standard deviations) as a reference interval (the Bland–Altman method). Analysing the 15 days together, the statistical protocol that combined multivariate distance, multiple linear regression and time series analysis showed a concordance of 93% on a 20-min flux basis and 87% on a daily basis (only 2 days fell outside the reference interval), and the overall flux differed only by 1.7% (3.2 g CO2 m−2). An automated version of this or a similar statistical protocol could be used as a standard way of filling gaps and processing data from Bowen ratio energy balance and other techniques (e.g. eddy covariance). This would enforce objectivity in comparisons of CO2 flux data that are generated by different research groups and streamline the protocols for quality control.  相似文献   

3.
胡英 《统计研究》2018,35(4):94-103
我国现行的人口统计调查方法体系是“以经常性的人口抽样调查为主体,以人口普查为基础,重点调查等为补充的多种方法的运用”,但随着经济社会的快速发展,“以普查为基础,经常性抽样调查为主体"的人口统计调查方法体系,与政府、社会对人口信息多层次、精细化、时效性的需求变得不适应,在实践中显露出矛盾和问题。本文以下将对当前人口普查和人口变动情况抽样调查,在人口统计中作用和存在问题做出分析,在此基础上提出人口统计的改革设想,并落实到具体的解决办法,以2020年第七次人口普查为契机,建立《人口统计与管理服务数据平台》,并结合社区网格化管理进行年度更新,取得年度全国、省级及省级以下的常住人口数据;同时改革人口抽样调查的内容,结合手机信令大数据在人口统计中的应用,完善人口统计调查方法体系。  相似文献   

4.
胡帆 《统计研究》2010,27(11):53-56
本文借鉴全面质量管理体系的概念,综合分析贯穿统计工作整个流程的统计调查数据质量管理的要素及作用。本文重点讨论了全面质量管理的流程和重点工作的布局;结合统计信息化的建设,特别讨论了相关工作规范、应用软件的作用,以及数据资源的建设和利用。  相似文献   

5.
Summary.  As a special case of statistical learning, ensemble methods are well suited for the analysis of opportunistically collected data that involve many weak and sometimes specialized predictors, especially when subject-matter knowledge favours inductive approaches. We analyse data on the incidental mortality of dolphins in the purse-seine fishery for tuna in the eastern Pacific Ocean. The goal is to identify those rare purse-seine sets for which incidental mortality would be expected but none was reported. The ensemble method random forests is used to classify sets according to whether mortality was (response 1) or was not (response 0) reported. To identify questionable reporting practice, we construct 'residuals' as the difference between the categorical response (0,1) and the proportion of trees in the forest that classify a given set as having mortality. Two uses of these residuals to identify suspicious data are illustrated. This approach shows promise as a means of identifying suspect data gathered for environmental monitoring.  相似文献   

6.
Everyday we face all kinds of risks, and insurance is in the business of providing us a means to transfer or share these risks, usually to eliminate or reduce the resulting financial burden, in exchange for a predetermined price or tariff. Actuaries are considered professional experts in the economic assessment of uncertain events, and equipped with many statistical tools for analytics, they help formulate a fair and reasonable tariff associated with these risks. An important part of the process of establishing fair insurance tariffs is risk classification, which involves the grouping of risks into various classes that share a homogeneous set of characteristics allowing the actuary to reasonably price discriminate. This article is a survey paper on the statistical tools for risk classification used in insurance. Because of recent availability of more complex data in the industry together with the technology to analyze these data, we additionally discuss modern techniques that have recently emerged in the statistics discipline and can be used for risk classification. While several of the illustrations discussed in the paper focus on general, or non-life, insurance, several of the principles we examine can be similarly applied to life insurance. Furthermore, we also distinguish between a priori and a posteriori ratemaking. The former is a process which forms the basis for ratemaking when a policyholder is new and insufficient information may be available. The latter process uses additional historical information about policyholder claims when this becomes available. In effect, the resulting a posteriori premium allows one to correct and adjust the previous a priori premium making the price discrimination even more fair and reasonable.  相似文献   

7.
ABSTRACT

In the 1990s, statisticians began thinking in a principled way about how computation could better support the learning and doing of statistics. Since then, the pace of software development has accelerated, advancements in computing and data science have moved the goalposts, and it is time to reassess. Software continues to be developed to help do and learn statistics, but there is little critical evaluation of the resulting tools, and no accepted framework with which to critique them. This article presents a set of attributes necessary for a modern statistical computing tool. The framework was designed to be broadly applicable to both novice and expert users, with a particular focus on making more supportive statistical computing environments. A modern statistical computing tool should be accessible, provide easy entry, privilege data as a first-order object, support exploratory and confirmatory analysis, allow for flexible plot creation, support randomization, be interactive, include inherent documentation, support narrative, publishing, and reproducibility, and be flexible to extensions. Ideally, all these attributes could be incorporated into one tool, supporting users at all levels, but a more reasonable goal is for tools designed for novices and professionals to “reach across the gap,” taking inspiration from each others’ strengths.  相似文献   

8.
李金昌 《统计研究》2020,37(2):119-128
数据作为重要的数据资源存在,不论是其内在蕴含的信息价值还是其已经成为人类社会所需数据有机组成的客观事实,都迫使我们去不断加强对大数据的应用。然而,由于大数据作为信息技术应用的副产品,其复杂性、不确定性和涌现性决定了我们应用大数据并非易事,存在着很多质量上的问题,除了具有传统数据所有的质量问题外,还包括一些独特的新问题。为了更好地应用大数据,本文对如何进行大数据应用的质量控制进行了初步的研究。主要内容包括以下三个方面:一是对什么是大数据质量、受哪些因素影响、可能存在哪些质量问题进行了探讨;二是从做好理论准备、建立质量控制方案、重视对小数据研究、加强大数据管理、加强大数据人才培养和加强大数据法制建设六个方面,提出了大数据应用的质量控制的基本想法;三是对大数据应用中需要引起注意的几个方面进行了讨论,并结合例子进行了阐释。  相似文献   

9.
通过用户满意度调查来测量用户对统计数据质量水平的主观感知,为统计数据质量评估与控制提供了一条重要的信息渠道。基于对政府统计部门实施用户满意度调查的社会背景及其必要性的分析,借鉴在国际同领域实践中起步较早的、较具代表性的欧洲统计系统用户满意度调查的相关实践经验,对中国政府统计部门实施用户满意度调查的制度保障、目标定位与内容设计以及调查组织实施等方面的实践要领进行了探讨。  相似文献   

10.
李金昌 《统计研究》2014,31(1):10-15
最近,《大数据时代》等几本书引起了广泛的关注,大数据正在改变着人们的行为与思维,那么以数据为研究对象的统计学该如何应对?本文基于对大数据的理解,认为统计思维需要发生三个方面的改变:即认识数据的思维、收集数据的思维和分析数据的思维要改变。其中,数据分析思维又要在统计分析过程、实证分析思路、推断分析逻辑等方面发生变化,同时统计分析评价的标准也要有所调整。围绕这些变化,本文提出需要从八个方面去积极应对大数据,以促使统计学科跟上时代的步伐。  相似文献   

11.
Most approaches to applying knowledge-based techniques for data analyses concentrate on the context-independent statistical support. EXPLORA however is developed for the subject-specific interpretation with regard to the contents of the data to be analyzed (i.e. content interpretation). Therefore its knowledge base includes also the objects and semantic relations of the real system that produces the data. In this paper we describe the functional model representing the process of content interpretation, summarize the software architecture of the system and give some examples of its applications by pilot-users in survey analysis. EXPLORA addresses applications with data produced regularly which have to be analyzed in a routine way. The system systematically searches for statistical results (facts) to detect relations which possibly could be overlooked by a human analyst. On the other hand EXPLORA will help overcome the large bulk of information which today is usually still produced when presenting the data. Therefore a second knowledge process of content interpretation consists in discovering messages about the data by condensing the facts. Approaches for inductive generalization which have been developed for machine learning are utilized to identify common values of attributes of the objects to which the facts relate. At a later stage the system searches for interesting facts by applying redundancy rules and domaindependent selection rules. EXPLORA formulates the messages in terms of the domain, groups and orders them and even provides flexible navigations in the fact spaces.  相似文献   

12.
米子川  姜天英 《统计研究》2016,33(11):11-18
2014年7月,澳盛银行首次将阿里巴巴系列指数纳入通胀观察标的,标志着大数据指数已经开始对传统的统计调查指数提出质疑和挑战。本文基于阿里巴巴aSPI指数和官方公布的CPI指数的比较研究,首次提出了aSPI指数显著优于CPI指数的一些基本特征;同时,通过实证分析对比了两种指数的同步性特征和分解性特征,即首先运用协整检验方法确定二者的同步性;其次通过EMD模型对二者进行序列分解,得出各自的波动成分和增长趋势;最后,在EMD对aSPI指数分解的基础上,通过Lasso回归估计了CPI指数。研究表明,随着对大数据研究的广泛性、科学性以及方法论和软件工具的进步,大数据指数对传统统计调查的佐证、补充乃至融合将会成为一种新趋势,通过实证、应用与发展,逐步产生新的CPI编制方法和分析体系,将是大数据指数理论和实践的根本出路。  相似文献   

13.
Summary.  Family Resources Survey (FRS) data for April 1997 to March 2000 are used to estimate the take-up of income support (IS) by a subset of pensioners. We scrutinize the quality of FRS data for this purpose and describe a process of identifying and correcting inconsistencies in the data. Comparisons are made, before and after corrections to the data, of take-up estimates, logistic regression take-up models and predictions of take-up responses to changes in IS rates. Overall, the corrections do not have large effects on estimated take-up rates but suggest that non-take-up is marginally less serious than the uncorrected data imply. Logistic regressions using corrected and uncorrected data were in broad agreement on the factors influencing take-up. There were some differences in the scale of these influences, with implications for predictions of take-up responses to changes in the generosity of IS. Desirable improvements in the FRS are identified.  相似文献   

14.
数据科学的统计学内涵   总被引:1,自引:0,他引:1  
数据科学以大数据为研究对象,而大数据对统计分析最直接的冲击莫过于数据收集方式的变革,同时统计分析的视野也不再局限于传统的属性数据,而是包括了关系数据、非结构、半结构数据等其他类型更丰富的数据。伴随着数据开放运动,数据库之间的关联信息的价值逐步得到体现。基于统计学的视角分别从科学理论基础、计算机处理技术和商业应用等三个维度研究了数据科学的统计学内涵,探讨了数据科学范式对统计分析过程的直接影响,以及统计学视角面临的机遇与挑战。  相似文献   

15.
ABSTRACT

Scientific research of all kinds should be guided by statistical thinking: in the design and conduct of the study, in the disciplined exploration and enlightened display of the data, and to avoid statistical pitfalls in the interpretation of the results. However, formal, probability-based statistical inference should play no role in most scientific research, which is inherently exploratory, requiring flexible methods of analysis that inherently risk overfitting. The nature of exploratory work is that data are used to help guide model choice, and under these circumstances, uncertainty cannot be precisely quantified, because of the inevitable model selection bias that results. To be valid, statistical inference should be restricted to situations where the study design and analysis plan are specified prior to data collection. Exploratory data analysis provides the flexibility needed for most other situations, including statistical methods that are regularized, robust, or nonparametric. Of course, no individual statistical analysis should be considered sufficient to establish scientific validity: research requires many sets of data along many lines of evidence, with a watchfulness for systematic error. Replicating and predicting findings in new data and new settings is a stronger way of validating claims than blessing results from an isolated study with statistical inferences.  相似文献   

16.
Expert opinion plays an important role when selecting promising clusters of chemical compounds in the drug discovery process. Indeed, experts can qualitatively assess the potential of each cluster, and with appropriate statistical methods, these qualitative assessments can be quantified into a success probability for each of them. However, one crucial element often overlooked is the procedure by which the clusters are assigned to/selected by the experts for evaluation. In the present work, the impact such a procedure may have on the statistical analysis and the entire evaluation process is studied. It has been shown that some implementations of the selection procedure may seriously compromise the validity of the evaluation even when the rating and selection processes are independent. Consequently, the fully random allocation of the clusters to the experts is strongly advocated. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

17.
Summary.  To obtain information about the contribution of individual and area level factors to population health, it is desirable to use both data collected on areas, such as censuses, and on individuals, e.g. survey and cohort data. Recently developed models allow us to carry out simultaneous regressions on related data at the individual and aggregate levels. These can reduce 'ecological bias' that is caused by confounding, model misspecification or lack of information and increase power compared with analysing the data sets singly. We use these methods in an application investigating individual and area level sociodemographic predictors of the risk of hospital admissions for heart and circulatory disease in London. We discuss the practical issues that are encountered in this kind of data synthesis and demonstrate that this modelling framework is sufficiently flexible to incorporate a wide range of sources of data and to answer substantive questions. Our analysis shows that the variations that are observed are mainly attributable to individual level factors rather than the contextual effect of deprivation.  相似文献   

18.
In recent years the focus of research in survey sampling has changed to include a number of nontraditional topics such as nonsampling errors. In addition, the availability of data from large-scale sample surveys, along with computers and software to analyze the data, have changed the tools needed by survey sampling statisticians. It has also resulted in a diverse group of secondary data users who wish to learn how to analyze data from a complex survey. Thus it is time to reassess what we should be teaching students about survey sampling. This article brings together a panel of experts on survey sampling and teaching to discuss their views on what should be taught in survey sampling classes and how it should be taught.  相似文献   

19.
Zusammenfassung Die Datenerhebungsprozesse in der amtlichen Statistik unterliegen einem tief greifenden Wandel. Zu diesem Wandel haben verschiedene Faktoren beigetragen: So verlangen die Auskunftgebenden oftmals eine spürbare Verringerung der Belastung. Gleichzeitig wünschen viele Nutzer tiefer gegliederte Informationen in einer wachsenden Zahl von Bereichen. Die statistischen ?mter sehen sich dabei mit immer neuen Budgetkürzungen konfrontiert, die ihre Reaktionsm?glichkeiten nicht selten einschr?nken. Schlie?lich ist im Europ?ischen Statistischen System eine weitere Harmonisierung der Datenerhebungsprozesse erforderlich. Der Wandel der Datenerhebungsprozesse hat nahezu alle Bereiche der amtlichen Statistik erfasst. Vielfach ersetzen heute Verwaltungsregister (wie im Fall des Unternehmensregisters oder des geplanten registergestützten Zensus) traditionelle Totalerhebungen. Bei neuen Statistiken tritt nicht selten die Freiwilligkeit der Auskunftserteilung an die Stelle der Auskunftspflicht. In der Unternehmensstatistik werden die Abl?ufe bei Befragung und Aufbereitung darüber hinaus grundlegend ver?ndert, wenn etwa die Unternehmen ihre Angaben automatisch aus dem betrieblichen Rechnungswesen an das statistische Amt senden. In anderen Bereichen erfordert eine gesunkene Teilnahmebereitschaft an Erhebungen den Einsatz zunehmend komplexer Stichprobenverfahren, wie etwa im Fall der Dauerstichprobe befragungsbereiter Haushalte. All diese Ver?nderungen haben erhebliche Auswirkungen auf die Messung, insbesondere auf die Behandlung fehlender und fehlerhafter Daten. Die Konzepte zur Messung von Fehlern in statistischen Erhebungen sind h?ufig noch immer ausgerichtet an der traditionellen Prim?rstatistik. Um eine Bewertung der ver?nderten Ans?tze in der Datenerhebung vornehmen zu k?nnen, bedürfen die Fehlertypen daher einer konzeptionellen Neuausrichtung. So unterscheiden sich z. B. registergestützte und traditionelle Erhebungen hinsichtlich der Konstruktion der Grundgesamtheit grundlegend. Mit den genannten Ver?nderungen des Erhebungsprozesses sind zugleich neue Ans?tze zur Erkennung und Korrektur von Antwortausf?llen erforderlich, z. B. im Fall fehlender oder fehlerhafter Werte in Verwaltungsregistern. Unterschiede zwischen den Merkmalsdefinitionen in Verwaltungsregistern und der amtlichen Statistik k?nnen schlie?lich zu neuen Typen von Messfehlern führen, die bislang nur selten Beachtung finden. Dieser Beitrag entwirft ein “Fehlerportfolio”, das bei der ?nderung von Erhebungsprozessen die Auswirkungen auf den Fehler der Ergebnisse bewerten hilft.
Data collection processes in official statistics are currently facing many pressures. Respondents ask for a reduced response burden. Users ask for more detailed information in a growing number of subject matter areas. Governments restrict the budget, thus at the same time restricting the possible use of data collection methods. And the European Statistical System requires a further harmonisation of the data collection methods used within the member states. These pressures are also reflected by fundamental changes in the way official statistics are collecting data. Administrative registers replaced traditional censuses in many areas (like the population census or in the whole field of business statistics). Voluntary surveys more and more replace surveys with mandatory response. In some fields, the logic of the response process is being radically reshaped: For example, in German business surveys, enterprises can now report their data directly from their Enterprise Resource Planning (ERP) systems. And in many areas increasingly complex sampling designs are used in order to enhance the efficiency of the fieldwork and to counteract a growing reluctance to participate in surveys. The paper focuses on the implications of these changes on survey errors. We argue that the concepts used to measure and assess survey errors still reflect the perspective of the ‘traditional’ survey with primary data collection. With the current changes in the data collection processes these concepts are no longer fully appropriate. For example, an assessment of the coverage errors has to take into account the differences in the construction of the target population in registers and traditional censuses. The changes in the response process necessitate changes in the way we measure and correct for nonresponse errors (e. g. detection of erroneous or missing values in registers). Measurement errors have to be conceived differently in order to cover, e. g., errors due to lacks in compliance of statistical concepts with the concepts used in public administration. We propose an ‘error portfolio’ enabling an assessment the impacts of changes in data collection processes on the survey error.
JEL classification C80  相似文献   

20.
Abstract.  Let X be a d -variate random vector that is completely observed, and let Y be a random variable that is subject to right censoring and left truncation. For arbitrary functions φ we consider expectations of the form E [ φ ( X ,  Y )], which appear in many statistical problems, and we estimate these expectations by using a product-limit estimator for censored and truncated data, extended to the context where covariates are present. An almost sure representation for these estimators is obtained, with a remainder term that is of a certain negligible order, uniformly over a class of φ -functions. This uniformity is important for the application to goodness-of-fit testing in regression and to inference for the regression depth, which we consider in more detail.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号