共查询到20条相似文献,搜索用时 15 毫秒
1.
Sean Kross Roger D. Peng Brian S. Caffo Ira Gooding Jeffrey T. Leek 《The American statistician》2020,74(1):1-7
AbstractOver the last three decades, data have become ubiquitous and cheap. This transition has accelerated over the last five years and training in statistics, machine learning, and data analysis has struggled to keep up. In April 2014, we launched a program of nine courses, the Johns Hopkins Data Science Specialization, which has now had more than 4 million enrollments over the past five years. Here, the program is described and compared to standard data science curricula as they were organized in 2014 and 2015. We show that novel pedagogical and administrative decisions introduced in our program are now standard in online data science programs. The impact of the Data Science Specialization on data science education in the U.S. is also discussed. Finally, we conclude with some thoughts about the future of data science education in a data democratized world. 相似文献
2.
数据科学的统计学内涵 总被引:1,自引:0,他引:1
数据科学以大数据为研究对象,而大数据对统计分析最直接的冲击莫过于数据收集方式的变革,同时统计分析的视野也不再局限于传统的属性数据,而是包括了关系数据、非结构、半结构数据等其他类型更丰富的数据。伴随着数据开放运动,数据库之间的关联信息的价值逐步得到体现。基于统计学的视角分别从科学理论基础、计算机处理技术和商业应用等三个维度研究了数据科学的统计学内涵,探讨了数据科学范式对统计分析过程的直接影响,以及统计学视角面临的机遇与挑战。 相似文献
3.
Because of ethical and practical difficulties, controlled experimentation is seldom possible in the field of industrial medicine. Safety standards for industrial hazards must therefore be based upon uncontrolled observational data. This paper is concerned with the effects of population selection and of the association between exposure and response upon the exposure-response relationships derived from uncontrolled studies. It is shown that serious distortions may result and that ‘safe limits’ derived from uncontrolled studies may underestimate the real hazard. 相似文献
4.
ABSTRACTData Science is one of the newest interdisciplinary areas. It is transforming our lives unexpectedly fast. This transformation is also happening in our learning styles and practicing habits. We advocate an approach to data science training that uses several types of computational tools, including R, bash, awk, regular expressions, SQL, and XPath, often used in tandem. We discuss ways for undergraduate mentees to learn about data science topics, at an early point in their training. We give some intuition for researchers, professors, and practitioners about how to effectively embed real-life examples into data science learning environments. As a result, we have a unified program built on a foundation of team-oriented, data-driven projects. 相似文献
5.
Peter Murray-Rust 《Serials Review》2008,34(1):52-64
Open Data (OD) is an emerging term in the process of defining how scientific data may be published and re-used without price or permission barriers. Scientists generally see published data as belonging to the scientific community, but many publishers claim copyright over data and will not allow its re-use without permission. This is a major impediment to the progress of scholarship in the digital age. This article reviews the need for Open Data, shows examples of why Open Data are valuable, and summarizes some early initiatives in formalizing the right of access to and re-use of scientific data. 相似文献
6.
7.
数据科学的发展与人才培养研究 总被引:2,自引:1,他引:2
《统计与信息论坛》2019,(1):117-122
梳理了数据科学学科形成与发展的四个阶段——正式诞生、涵义演变、专业发展、广泛应用,概述了国内外对数据科学人才的需求态势和人才培养现状,并提出了具体的人才培养策略:厘清概念认知,明确专业人才的知识结构;编译主干课程的教材,构建数据科学课程群;通过高校、政府和企业的协同育人,培养多类型的专业人才;积极开发软硬件,创建高质量的大数据实践实训平台。 相似文献
8.
9.
Michael Brendel Arnold Janssen Claus‐Dieter Mayer Markus Pauly 《Scandinavian Journal of Statistics》2014,41(3):742-761
In biomedical research, weighted logrank tests are frequently applied to compare two samples of randomly right censored survival times. We address the question how to combine a number of weighted logrank statistics to achieve good power of the corresponding survival test for a whole linear space or cone of alternatives, which are given by hazard rates. This leads to a new class of semiparametric projection tests that are motivated by likelihood ratio tests for an asymptotic model. We show that these tests can be carried out as permutation tests and discuss their asymptotic properties. A simulation study together with the analysis of a classical data set illustrates the advantages. 相似文献
10.
By assuming that the underlying distribution belongs to the domain of attraction of an extreme value distribution, one can extrapolate the data to a far tail region so that a rare event can be predicted. However, when the distribution is in the domain of attraction of a Gumbel distribution, the extrapolation is quite limited generally in comparison with a heavy tailed distribution. In view of this drawback, a Weibull tailed distribution has been studied recently. Some methods for choosing the sample fraction in estimating the Weibull tail coefficient and some bias reduction estimators have been proposed in the literature. In this paper, we show that the theoretical optimal sample fraction does not exist and a bias reduction estimator does not always produce a smaller mean squared error than a biased estimator. These are different from using a heavy tailed distribution. Further we propose a refined class of Weibull tailed distributions which are more useful in estimating high quantiles and extreme tail probabilities. 相似文献
11.
This article assesses the potential magnitude of the loss of estimation efficiency caused by the adoption of a differenced model when the disturbances of the original (levels) linear regression model follow either a stable (autoregressive) AR(1) process or a fixed start-up random-walk process (hence no filtering is necessary from the standpoint of estimation). The magnitude of the loss, which can be quite large, is found to be affected by both the form of the original model (homogeneous or nonhomogeneous) and the sign and magnitude of the autocorrelation coefficient of the AR(1) disturbance, as well as by the nature of the exogenous variable (smoothly trended or not). 相似文献
12.
数据的质量直接影响数据分析的效率和分析结果的可靠性。数据质量包括数据结构质量和给定数据结构后的数据真实性、一致性和完整性。在着重考虑拿到数据之后,从单元格、记录、变量三个角度如何识别数据中潜在的质量问题,并以案例为支撑,介绍了各种可能出现的问题。 相似文献
13.
We discuss the statistical properties of return-based OLS style analysis introduced by Sharpe (1992). The aim of style analysis is to infer a fund managers investment decisions using only publicly available data on the fund performance and on the time evolution of market indexes. We show that the model proposed by Sharpe suffers of relevant drawbacks, most notably that it fails to yield correct results even in the simple case of a buy-and-hold strategy that only invests in the market indexes. Under this hypothesis we show that a model linear in index levels, as opposed to index returns, estimated via a Kalman filter avoids Sharpes model drawbacks. We further extend our analysis to strategies where the fund manager policy changes with time and the asset classes in which the fund manager invests are not known exactly. In this last case we show that a style analysis is possible only conditional to either an orthogonality hypothesis on the active investment strategy, or by the introduction of suitable instrumental variables.The authors are grateful to the editor and an anonymous referee for many comments which greatly helped in improving the paper. The authors are, obviously, fully responsible for any remaining error. 相似文献
14.
一、引言科学技术 ,特别是研究与发展 (R&D)是经济发展和社会进步的内在推动力。为了科学地制定和评估国家政策 ,特别是科技政策 ,需要调查和系统地收集有关科学技术活动的信息和数据。OECD (经济合作与发展组织 )为推荐调查R&D活动的标准与规范 ,发布了《研究与发展调查手册》。我国从 1 985年科技普查起就以国际标准和规范为依据 ,并结合我国的国情 ,经过十多年的不懈努力 ,形成了我国科技统计的规范和体系。按照OECD机构分类的标准 ,我国的R&D资源主要分布在政府部门属研究与开发机构、企业、高等教育部门的大专院校。… 相似文献
15.
为了解决科技教育评价中针对同一评价对象,选取相同的指标,采取同样的数据,但不同评价方法得出的评价结果不一致。本文以《泰晤士报高等教育副刊》世界大学排名为例,提出了一种新的组合评价方法——共性数据排序选择模型,其原理是,首先用各种可行的评价方法对科技教育评价对象进行评价,然后将评价结果排序后进行分级,筛选出各种评价方法公认的评价对象,接着采用排序多元选择模型进行回归,得到各指标的回归系数,将其标准化后作为组合评价的权重,最后进行加权汇总得到评价结果。共性数据排序选择模型克服了其他组合方法少数服从多数,对评价方法有限制的缺点。 相似文献
17.
探索性数据分析方法的应用——沪深股市股票价格与流通盘的定量关系 总被引:1,自引:0,他引:1
文章简要地介绍了一种高效稳健的知识发现技术———探索性数据分析,并应用该方法研究了三个时间点,中国股市股票价格的60日均价与股票流通盘的关系,得出了它们之间的定量关系,显示了探索性数据分析对于分析股市数据的有效性。 相似文献
18.
19.
20.
Ross H. Taplin 《Journal of the Royal Statistical Society. Series C, Applied statistics》1997,46(4):493-512
A new statistic and a new method of analysis are proposed for data where a sample of respondents provides a preference ordering of some treatments. The new preference statistic is compared with the Friedman statistic, particularly for an example where 12 home owners each ranked four grasses. The new analysis provides a more natural and less misleading assessment of where the differences occur than an analysis based on the rank sums of the Friedman statistic. The new analysis is also more robust to deviations from the classical location problem, is not related to election methods known to have undesirable characteristics and adheres to the Condorcet criterion for election methods. 相似文献