数据科学的统计学内涵   总被引:1,自引:0,他引:1  
数据科学以大数据为研究对象,而大数据对统计分析最直接的冲击莫过于数据收集方式的变革,同时统计分析的视野也不再局限于传统的属性数据,而是包括了关系数据、非结构、半结构数据等其他类型更丰富的数据。伴随着数据开放运动,数据库之间的关联信息的价值逐步得到体现。基于统计学的视角分别从科学理论基础、计算机处理技术和商业应用等三个维度研究了数据科学的统计学内涵,探讨了数据科学范式对统计分析过程的直接影响,以及统计学视角面临的机遇与挑战。  相似文献   

Because of ethical and practical difficulties, controlled experimentation is seldom possible in the field of industrial medicine. Safety standards for industrial hazards must therefore be based upon uncontrolled observational data. This paper is concerned with the effects of population selection and of the association between exposure and response upon the exposure-response relationships derived from uncontrolled studies. It is shown that serious distortions may result and that ‘safe limits’ derived from uncontrolled studies may underestimate the real hazard.  相似文献   

Open Data (OD) is an emerging term in the process of defining how scientific data may be published and re-used without price or permission barriers. Scientists generally see published data as belonging to the scientific community, but many publishers claim copyright over data and will not allow its re-use without permission. This is a major impediment to the progress of scholarship in the digital age. This article reviews the need for Open Data, shows examples of why Open Data are valuable, and summarizes some early initiatives in formalizing the right of access to and re-use of scientific data.  相似文献   


Data Science is one of the newest interdisciplinary areas. It is transforming our lives unexpectedly fast. This transformation is also happening in our learning styles and practicing habits. We advocate an approach to data science training that uses several types of computational tools, including R, bash, awk, regular expressions, SQL, and XPath, often used in tandem. We discuss ways for undergraduate mentees to learn about data science topics, at an early point in their training. We give some intuition for researchers, professors, and practitioners about how to effectively embed real-life examples into data science learning environments. As a result, we have a unified program built on a foundation of team-oriented, data-driven projects.  相似文献   

随着信息社会的来临、市场环境变迁的加速和不确定性的冲击,企业经营者纷纷面临市场饱和、产品加速淘汰、消费者嗜好善变及国外竞争者涌现等问题,单凭管理者的直觉反应和主观判断已无法因应决策上的需要;如何能化被动的适应为主动的评估趋势已成为现今的重要课题。资料分析科学(DataScientific) 事业在市场趋向专业分工化的前提下,逐渐已成为管理决策者一项不可或缺的工具,而企业主们也渐渐能接受资料分析科学为外显成本的观念;然而,何谓资料分析科学?以下将针对资料分析科学做一介绍。资料分析科学之目的 资料分析科学之目的在…  相似文献   

数据科学的发展与人才培养研究   总被引:2,自引:1,他引:2  
梳理了数据科学学科形成与发展的四个阶段——正式诞生、涵义演变、专业发展、广泛应用,概述了国内外对数据科学人才的需求态势和人才培养现状,并提出了具体的人才培养策略:厘清概念认知,明确专业人才的知识结构;编译主干课程的教材,构建数据科学课程群;通过高校、政府和企业的协同育人,培养多类型的专业人才;积极开发软硬件,创建高质量的大数据实践实训平台。  相似文献   

In biomedical research, weighted logrank tests are frequently applied to compare two samples of randomly right censored survival times. We address the question how to combine a number of weighted logrank statistics to achieve good power of the corresponding survival test for a whole linear space or cone of alternatives, which are given by hazard rates. This leads to a new class of semiparametric projection tests that are motivated by likelihood ratio tests for an asymptotic model. We show that these tests can be carried out as permutation tests and discuss their asymptotic properties. A simulation study together with the analysis of a classical data set illustrates the advantages.  相似文献   

By assuming that the underlying distribution belongs to the domain of attraction of an extreme value distribution, one can extrapolate the data to a far tail region so that a rare event can be predicted. However, when the distribution is in the domain of attraction of a Gumbel distribution, the extrapolation is quite limited generally in comparison with a heavy tailed distribution. In view of this drawback, a Weibull tailed distribution has been studied recently. Some methods for choosing the sample fraction in estimating the Weibull tail coefficient and some bias reduction estimators have been proposed in the literature. In this paper, we show that the theoretical optimal sample fraction does not exist and a bias reduction estimator does not always produce a smaller mean squared error than a biased estimator. These are different from using a heavy tailed distribution. Further we propose a refined class of Weibull tailed distributions which are more useful in estimating high quantiles and extreme tail probabilities.  相似文献   

数据的质量直接影响数据分析的效率和分析结果的可靠性。数据质量包括数据结构质量和给定数据结构后的数据真实性、一致性和完整性。在着重考虑拿到数据之后,从单元格、记录、变量三个角度如何识别数据中潜在的质量问题,并以案例为支撑,介绍了各种可能出现的问题。  相似文献   

This article assesses the potential magnitude of the loss of estimation efficiency caused by the adoption of a differenced model when the disturbances of the original (levels) linear regression model follow either a stable (autoregressive) AR(1) process or a fixed start-up random-walk process (hence no filtering is necessary from the standpoint of estimation). The magnitude of the loss, which can be quite large, is found to be affected by both the form of the original model (homogeneous or nonhomogeneous) and the sign and magnitude of the autocorrelation coefficient of the AR(1) disturbance, as well as by the nature of the exogenous variable (smoothly trended or not).  相似文献   

We discuss the statistical properties of return-based OLS style analysis introduced by Sharpe (1992). The aim of style analysis is to infer a fund managers investment decisions using only publicly available data on the fund performance and on the time evolution of market indexes. We show that the model proposed by Sharpe suffers of relevant drawbacks, most notably that it fails to yield correct results even in the simple case of a buy-and-hold strategy that only invests in the market indexes. Under this hypothesis we show that a model linear in index levels, as opposed to index returns, estimated via a Kalman filter avoids Sharpes model drawbacks. We further extend our analysis to strategies where the fund manager policy changes with time and the asset classes in which the fund manager invests are not known exactly. In this last case we show that a style analysis is possible only conditional to either an orthogonality hypothesis on the active investment strategy, or by the introduction of suitable instrumental variables.The authors are grateful to the editor and an anonymous referee for many comments which greatly helped in improving the paper. The authors are, obviously, fully responsible for any remaining error.  相似文献   

科技统计中研究与开发机构R&D数据折算方法的研究   总被引:1,自引:0,他引:1       下载免费PDF全文
杨宏进 《统计研究》2002,19(2):30-32
一、引言科学技术 ,特别是研究与发展 (R&D)是经济发展和社会进步的内在推动力。为了科学地制定和评估国家政策 ,特别是科技政策 ,需要调查和系统地收集有关科学技术活动的信息和数据。OECD (经济合作与发展组织 )为推荐调查R&D活动的标准与规范 ,发布了《研究与发展调查手册》。我国从 1 985年科技普查起就以国际标准和规范为依据 ,并结合我国的国情 ,经过十多年的不懈努力 ,形成了我国科技统计的规范和体系。按照OECD机构分类的标准 ,我国的R&D资源主要分布在政府部门属研究与开发机构、企业、高等教育部门的大专院校。…  相似文献   

俞立平 《统计研究》2009,26(7):103-108
 为了解决科技教育评价中针对同一评价对象,选取相同的指标,采取同样的数据,但不同评价方法得出的评价结果不一致。本文以《泰晤士报高等教育副刊》世界大学排名为例,提出了一种新的组合评价方法——共性数据排序选择模型,其原理是,首先用各种可行的评价方法对科技教育评价对象进行评价,然后将评价结果排序后进行分级,筛选出各种评价方法公认的评价对象,接着采用排序多元选择模型进行回归,得到各指标的回归系数,将其标准化后作为组合评价的权重,最后进行加权汇总得到评价结果。共性数据排序选择模型克服了其他组合方法少数服从多数,对评价方法有限制的缺点。  相似文献   

A new statistic and a new method of analysis are proposed for data where a sample of respondents provides a preference ordering of some treatments. The new preference statistic is compared with the Friedman statistic, particularly for an example where 12 home owners each ranked four grasses. The new analysis provides a more natural and less misleading assessment of where the differences occur than an analysis based on the rank sums of the Friedman statistic. The new analysis is also more robust to deviations from the classical location problem, is not related to election methods known to have undesirable characteristics and adheres to the Condorcet criterion for election methods.  相似文献   

空间数据分析的发展   总被引:1,自引:0,他引:1  
空间统计学是研究空间问题的一门学科,它是应用数学快速发展的一个分支。尽管传统的数据分析中有许多很好的方法,但却不能完全地套用于空间数据的分析。空间模型的估计不仅与各种回归形式的假设有关,而且还与空间相关、空间异质的特性有关。从空间模型及推断、适应性估计、非参数回归、空间不相关性检验几个方面研究了空间数据分析方法的发展以及未来的趋势。  相似文献   

文章简要地介绍了一种高效稳健的知识发现技术———探索性数据分析,并应用该方法研究了三个时间点,中国股市股票价格的60日均价与股票流通盘的关系,得出了它们之间的定量关系,显示了探索性数据分析对于分析股市数据的有效性。  相似文献   

数据挖掘中的关联规则   总被引:7,自引:0,他引:7       下载免费PDF全文
 数据挖掘是近些年企业界相当热门的话题,它利用统计与人工智能的算法,从庞大的企业;历史资料中,找出隐藏的规律并建立准确的模型,用以预测未来。其中关联规则的挖掘是数据挖掘中的一个重要问题。  相似文献   

