首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
统计数据预处理的理论与方法述评   总被引:1,自引:0,他引:1  
统计数据预处理是提升数据质量的重要阶段,包括数据审查、数据清理、数据转换和数据验证四大步骤。根据处理对象的特点及每一步骤的不同目标,统计数据预处理可采用的方法包括描述及探索性分析、缺失值处理、异常值处理、数据变换技术、信度与效度检验、宏观数据诊断等六大类。选用恰当的方法开展统计数据预处理,有利于保证数据分析结论真实、有效。  相似文献   

2.
3.
缺失数据是影响调查问卷数据质量的重要因素,对调查问卷中的缺失值进行插补可以显著提高调查数据的质量。调查问卷的数据类型多以分类型数据为主,数据挖掘技术中的分类算法是处理属性分类问题的常用方法,随机森林模型是众多分类算法中精度较高的方法之一。将随机森林模型引入调查问卷缺失数据的插补研究中,提出了基于随机森林模型的分类数据缺失值插补方法,并根据不同的缺失模式探讨了相应的插补步骤。通过与其它方法的实证模拟比较,表明随机森林插补法得到的插补值准确度更优、可信度更高。  相似文献   

4.
电子商务数据管理作为电子商务发展的基础,为经济单元电子商务的正常有效运作提供支持。因此要建立电子商务数据平台;对电子商务数据进行管理,包括物流数据的收集与管理、资金流数据的分析与管理、信息流数据的采集与管理;阐明实现电子商务数据管理最优的社会条件和企业条件。管理者最终能充分利用龟子商务平台提供的高效、便捷的数据优势更好地管理企业。  相似文献   

5.
ABSTRACT

This article presents a Bayesian analysis of the von Mises–Fisher distribution, which is the most important distribution in the analysis of directional data. We obtain samples from the posterior distribution using a sampling-importance-resampling method. The procedure is illustrated using simulated data as well as real data sets previously analyzed in the literature.  相似文献   

6.
The paper gives a review of a number of data models for aggregate statistical data which have appeared in the computer science literature in the last ten years.After a brief introduction to the data model in general, the fundamental concepts of statistical data are introduced. These are called statistical objects because they are complex data structures (vectors, matrices, relations, time series, etc) which may have different possible representations (e.g. tables, relations, vectors, pie-charts, bar-charts, graphs, and so on). For this reason a statistical object is defined by two different types of attribute (a summary attribute, with its own summary type and with its own instances, called summary data, and the set of category attributes, which describe the summary attribute). Some conceptual models of statistical data (CSM, SDM4S), some semantic models of statistical data (SCM, SAM*, OSAM*), and some graphical models of statistical data (SUBJECT, GRASS, STORM) are also discussed.  相似文献   

7.
Inequality-restricted hypotheses testing methods containing multivariate one-sided testing methods are useful in practice, especially in multiple comparison problems. In practice, multivariate and longitudinal data often contain missing values since it may be difficult to observe all values for each variable. However, although missing values are common for multivariate data, statistical methods for multivariate one-sided tests with missing values are quite limited. In this article, motivated by a dataset in a recent collaborative project, we develop two likelihood-based methods for multivariate one-sided tests with missing values, where the missing data patterns can be arbitrary and the missing data mechanisms may be non-ignorable. Although non-ignorable missing data are not testable based on observed data, statistical methods addressing this issue can be used for sensitivity analysis and might lead to more reliable results, since ignoring informative missingness may lead to biased analysis. We analyse the real dataset in details under various possible missing data mechanisms and report interesting findings which are previously unavailable. We also derive some asymptotic results and evaluate our new tests using simulations.  相似文献   

8.
网上拍卖中竞买者出价数据的特征及分析方法研究   总被引:2,自引:1,他引:1  
在传统统计分析中,研究者面对的数值型数据有三种形式,即横截面数据、时间序列数据以及混合数据。这些类型的数据具有离散、等间隔分布、密度均匀等特点,它们是传统的描述性统计和推断性统计中最主要的数据分析对象。然而,从拍卖网站收集到的诸如竞买者出价等数据,却不具备这些特点,对传统统计分析方法提出了挑战。因此需要从数据容量、数据的混合性、不等间隔分布及数据密度等方面,对网上拍卖数据的产生机制进行阐释,对其特征进行分析,并结合实际网上拍卖资料给出分析此类数据的方法和过程。  相似文献   

9.
对政府统计数据质量成本的探讨   总被引:3,自引:2,他引:1       下载免费PDF全文
傅德印  陶然 《统计研究》2007,24(8):9-12
本文在界定统计数据质量成本含义基础上,对统计数据质量成本构成进行了分析,给出了统计数据质量成本要素表及核算方法,以及统计数据质量成本分析、预测、计划和控制的内容,并对统计数据质量成本和统计数据质量之间关系进行了探讨。  相似文献   

10.
An imputation procedure is a procedure by which each missing value in a data set is replaced (imputed) by an observed value using a predetermined resampling procedure. The distribution of a statistic computed from a data set consisting of observed and imputed values, called a completed data set, is affecwd by the imputation procedure used. In a Monte Carlo experiment, three imputation procedures are compared with respect to the empirical behavior of the goodness-of- fit chi-square statistic computed from a completed data set. The results show that each imputation procedure affects the distribution of the goodness-of-fit chi-square statistic in 3. different manner. However, when the empirical behavior of the goodness-of-fit chi-square statistic is compared u, its appropriate asymptotic distribution, there are no substantial differences between these imputation procedures.  相似文献   

11.
当前所获取的大数据并非都是总体数据,通常未能完全覆盖总体,因其多源异构的特性,致使传统的数据分析方法受阻。文章将抽样调查方法引入到大数据中,对大数据背景下应用多重抽样框的必要性进行剖析,并主要针对大数据中数据多源异构的难点,将每个来源数据作为一个抽样框进行处理,提出了大数据中多重抽样框的构建。进而根据大数据的数据特征进行分类,针对不同情况确定是否需要进行分阶段抽样设计,并提出运用SF估计量对基于多重抽样框的总体进行估计,此估计量较为符合大数据中多重抽样估计的需求,并能对总体有较好的估计。  相似文献   

12.
The currently existing estimation methods and goodness-of-fit tests for the Cox model mainly deal with right censored data, but they do not have direct extension to other complicated types of censored data, such as doubly censored data, interval censored data, partly interval-censored data, bivariate right censored data, etc. In this article, we apply the empirical likelihood approach to the Cox model with complete sample, derive the semiparametric maximum likelihood estimators (SPMLE) for the Cox regression parameter and the baseline distribution function, and establish the asymptotic consistency of the SPMLE. Via the functional plug-in method, these results are extended in a unified approach to doubly censored data, partly interval-censored data, and bivariate data under univariate or bivariate right censoring. For these types of censored data mentioned, the estimation procedures developed here naturally lead to Kolmogorov-Smirnov goodness-of-fit tests for the Cox model. Some simulation results are presented.  相似文献   

13.
Most data used to study the durations of unemployment spells come from the Current Population Survey (CPS), which is a point-in-time survey and gives an incomplete picture of the underlying duration distribution. We introduce a new sample of completed unemployment spells obtained from panel data and apply CPS sampling and reporting techniques to replicate the type of data used by other researchers. Predicted duration distributions derived from this CPS-like data are then compared to the actual distribution. We conclude that the best inferences that can be made about unemployment durations by using CPS-like data are seriously biased.  相似文献   

14.
The paper presents a new approach to interrelated two-way clustering of gene expression data. Clustering of genes has been effected using entropy and a correlation measure, whereas the samples have been clustered using the fuzzy C-means. The efficiency of this approach has been tested on two well known data sets: the colon cancer data set and the leukemia data set. Using this approach, we were able to identify the important co-regulated genes and group the samples efficiently at the same time.  相似文献   

15.
Summary.  Statistical agencies that own different databases on overlapping subjects can benefit greatly from combining their data. These benefits are passed on to secondary data analysts when the combined data are disseminated to the public. Sometimes combining data across agencies or sharing these data with the public is not possible: one or both of these actions may break promises of confidentiality that have been given to data subjects. We describe an approach that is based on two stages of multiple imputation that facilitates data sharing and dissemination under restrictions of confidentiality. We present new inferential methods that properly account for the uncertainty that is caused by the two stages of imputation. We illustrate the approach by using artificial and genuine data.  相似文献   

16.
扫描数据为政府统计源头数据信息化改革与宏观经济测度提供了新的技术范式。基于对世界各国利用扫描数据编制CPI的现状进行梳理研究,并针对中国扫描数据的现状和政府价格统计的特点,提出了一种利用扫描数据编制中国CPI的思路,力图为基于"大数据"的政府统计源头数据信息化改革提供理论和实践参考。  相似文献   

17.
Recurrent events involve the occurrences of the same type of event repeatedly over time and are commonly encountered in longitudinal studies. Examples include seizures in epileptic studies or occurrence of cancer tumors. In such studies, interest lies in the number of events that occur over a fixed period of time. One considerable challenge in analyzing such data arises when a large proportion of patients discontinues before the end of the study, for example, because of adverse events, leading to partially observed data. In this situation, data are often modeled using a negative binomial distribution with time‐in‐study as offset. Such an analysis assumes that data are missing at random (MAR). As we cannot test the adequacy of MAR, sensitivity analyses that assess the robustness of conclusions across a range of different assumptions need to be performed. Sophisticated sensitivity analyses for continuous data are being frequently performed. However, this is less the case for recurrent event or count data. We will present a flexible approach to perform clinically interpretable sensitivity analyses for recurrent event data. Our approach fits into the framework of reference‐based imputations, where information from reference arms can be borrowed to impute post‐discontinuation data. Different assumptions about the future behavior of dropouts dependent on reasons for dropout and received treatment can be made. The imputation model is based on a flexible model that allows for time‐varying baseline intensities. We assess the performance in a simulation study and provide an illustration with a clinical trial in patients who suffer from bladder cancer. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

18.
Statistical process control of multi-attribute count data has received much attention with modern data-acquisition equipment and online computers. The multivariate Poisson distribution is often used to monitor multivariate attributes count data. However, little work has been done so far on under- or over-dispersed multivariate count data, which is common in many industrial processes, with positive or negative correlation. In this study, a Shewhart-type multivariate control chart is constructed to monitor such kind of data, namely the multivariate COM-Poisson (MCP) chart, based on the MCP distribution. The performance of the MCP chart is evaluated by the average run length in simulation. The proposed chart generalizes some existing multivariate attribute charts as its special cases. A real-life bivariate process and a simulated trivariate Poisson process are used to illustrate the application of the MCP chart.  相似文献   

19.
文章针对大量复杂的靶场观测数据,通过构造初始拟合数据,利用B样条曲线的方法构造递推模型,使用基于样条平滑方法估计的判断门限对双向检验的结果数据是否异常进行判定,并且对满足修复条件的数据进行拟合修复,当双向检验的结果不同时,通过构造内推模型来进一步检验。实例分析表明:文章提出的方法相对其他方法能更有效地剔除异常数据,通过数据分段处理能更有效地检验那些可能产生阶段性跳跃的数据,使得模型具有更好的稳定性、更广的适用性和更高的异常数据剔除率。  相似文献   

20.
提出一种"Zadeh"式模糊数据,并探讨这种模糊数据的模糊样本均值及其统计检验问题,给出模糊等于、模糊属于的定义,提出离散型和连续型模糊总体均值检验方法,并利用一些实例阐述了此类统计方法的应用。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号