首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 906 毫秒
1.
数据开采的数据质量问题   总被引:1,自引:0,他引:1  
数据质量是影响数据开采效果的重要因素,这个问题并未受到人们的充分重视.本文针对数据质量在数据开采中的地位,给出了数据质量评价的几个主要尺度.并且结合统计学和机器学习的理论,分析了解决数据质量的方法,强调提高数据质量的出发点在于控制数据源的质量.  相似文献   

2.
3.
As modern organizations gather, analyze, and share large quantities of data, issues of privacy, and confidentiality are becoming increasingly important. Perturbation methods are used to protect confidentiality when confidential, numerical data are shared or disseminated for analysis. Unfortunately, existing perturbation methods are not suitable for protecting small data sets. With small data sets, existing perturbation methods result in reduced protection against disclosure risk due to sampling error. Sampling error may also produce different results from the analysis of perturbed data compared to the original data, reducing data utility. In this study, we develop an enhancement of an existing perturbation technique, General Additive Data Perturbation, that can be used to effectively mask both large and small data sets. The proposed enhancement minimizes the risk of disclosure while ensuring that the results of commonly performed statistical analyses are identical and equal for both the original and the perturbed data.  相似文献   

4.
作为重要的战略资源,大数据中包含诸多关键的管理问题.文章首先评述了基于不同视角对大数据的认识.然后,从管理的视角看大数据,指出大数据是一类重要的战略性信息资源,并从复杂性、决策有用性、高速增长性、价值稀疏性、可重复开采性和功能多样性等6个方面探究了大数据资源的管理特征.最后,提炼并探讨了大数据资源的获取问题、加工问题、应用问题、产权问题、产业问题和法规问题等6个方面的关键管理问题.  相似文献   

5.
This paper reexamines the scaling approaches used in cancer risk assessment and proposes a more precise body weight scaling factor. Two approaches are conventionally used in scaling exposure and dose from experimental animals to man: body weight scaling (used by FDA) and surface area scaling (BW0.67--used by EPA). This paper reanalyzes the Freireich et al. (1966) study of the maximum tolerated dose (MTD) of 14 anticancer agents in mice, rats, dogs, monkeys, and humans, the dataset most commonly cited as justification for surface area extrapolation. This examination was augmented with an analysis of a similar dataset by Schein et al. (1970) of the MTD of 13 additional chemotherapy agents. The reanalysis shows that BW0.75 is a more appropriate scaling factor for the 27 direct-acting compounds in this dataset.  相似文献   

6.
The objective of this article is to study the impact of weather on the damage caused by fire incidents across the United States. The article uses two sets of big data—-fire incidents data from the National Fire Incident Reporting System (NFIRS) and weather data from the National Oceanic and Atmospheric Administration (NOAA)—to obtain a single comprehensive data set for prediction and analysis of fire risk. In the article, the loss is referred to as “Total Percent Loss,” a metric that is calculated based on the content and property loss incurred by an owner over the total value of content and property. Gradient boosting tree (GBT), a machine learning algorithm, is implemented on the processed data to predict the losses due to fire incidents. An R2 value of 0.933 and mean squared error (MSE) of 124.641 out of 10,000 signify the extent of high predictive accuracy obtained by implementing the GBT model. In addition to this, an excellent predictive performance demonstrated by the GBT model is further validated by a strong fitting between the predicted loss and the actual loss for the test data set, with an R2 value of 0.97. While analyzing the influence of each input variable on the output, it is observed that the state in which a fire incident takes place plays a major role in determining fire risk. This article provides useful insights to fire managers and researchers in the form of a detailed framework of big data and predictive analytics for effective management of fire risk.  相似文献   

7.
作为重要的战略资源,大数据中包含诸多关键的管理问题. 文章首先评述了基于不同视角对大数据的认识. 然后,从管理的视角看大数据,指出大数据是一类重要的战略性信息资源,并从复杂性、决策有用性、高速增长性、价值稀疏性、可重复开采性和功能多样性等6 个方面探究了大数据资源的管理特征. 最后,提炼并探讨了大数据资源的获取问题、加工问题、应用问题、产权问题、产业问题和法规问题等6 个方面的关键管理问题  相似文献   

8.
陈松蹊  毛晓军  王聪 《管理世界》2022,38(1):196-206
随着数字经济时代的到来,数据作为一种重要的生产要素,深刻改变了管理决策范式。对具有超规模、跨领域、流信息的大数据的分析利用成为了赋能管理实践的重要因素,其中数据的质量与完备性是影响后续数据价值提炼的重要前提。然而受限于数据采集方式与过程,被采集主体行为模式特点等因素,数据常常呈现超高缺失率的特点。超高数据缺失会严重影响数据分析及所承载的管理决策效果。因而,预先对大数据进行有效完备化对保证后续分析决策效果具有重要意义。本文对大数据情境下的数据完备化问题进行了系统梳理,重点给出在超高维度、多源异构、时空关联的情境下的大数据完备化问题的主要挑战、求解思路及其对管理学研究的启示,以期为大数据完备化及赋能管理决策奠定理论和方法学基础。  相似文献   

9.
思维导向数据挖掘的时象是管理过程中的历史决策数据.思维导向数据挖掘所依赖的拓扑结构是心智概念图.思维导向数据挖掘分为两个阶段:计算关联主题簇群和计算最频繁思维路径.本文在心智图和概念图的基础上构造了心智概念图,并根据心智概念图提出了思维导向数据挖掘的算法.  相似文献   

10.
Empowered by virtualization technology, service requests from cloud users can be honored through creating and running virtual machines. Virtual machines established for different users may be allocated to the same physical server, making the cloud vulnerable to co‐residence attacks where a malicious attacker can steal a user's data through co‐residing their virtual machines on the same server. For protecting data against the theft, the data partition technique is applied to divide the user's data into multiple blocks with each being handled by a separate virtual machine. Moreover, early warning agents (EWAs) are deployed to possibly detect and prevent co‐residence attacks at a nascent stage. This article models and analyzes the attack success probability (complement of data security) in cloud systems subject to competing attack detection process (by EWAs) and data theft process (by co‐residence attackers). Based on the suggested probabilistic model, the optimal data partition and protection policy is determined with the objective of minimizing the user's cost subject to providing a desired level of data security. Examples are presented to illustrate effects of different model parameters (attack rate, number of cloud servers, number of data blocks, attack detection time, and data theft time distribution parameters) on the attack success probability and optimization solutions.  相似文献   

11.
Estimation from Zero-Failure Data   总被引:2,自引:0,他引:2  
When performing quantitative (or probabilistic) risk assessments, it is often the case that data for many of the potential events in question are sparse or nonexistent. Some of these events may be well-represented by the binomial probability distribution. In this paper, a model for predicting the binomial failure probability, P , from data that include no failures is examined. A review of the literature indicates that the use of this model is currently limited to risk analysis of energetic initiation in the explosives testing field. The basis for the model is discussed, and the behavior of the model relative to other models developed for the same purpose is investigated. It is found that the qualitative behavior of the model is very similar to that of the other models, and for larger values of n (the number of trials), the predicted P values varied by a factor of about eight among the five models examined. Analysis revealed that the estimator is nearly identical to the median of a Bayesian posterior distribution, derived using a uniform prior. An explanation of the application of the estimator in explosives testing is provided, and comments are offered regarding the use of the estimator versus other possible techniques.  相似文献   

12.
13.
刘长松 《科学咨询》2010,(10):78-79
计算机及其外部设备,包括服务器、个人计算机、显示器和多功能一体机等办公自动化设备在工作中,都会产生不同程度的电磁辐射。在使用频谱分析仪对这些设备进行检测中,如何利用计算机技术对其进行处理和分析,以提高其测量精度,本文将从数据拟合的基本原理入手,结合计算机技术的特点,简要介绍如何使用数据拟合方法优化频谱数据分析,提高其测量精度,并对频谱数据拟合前后的图形进行了对比分析。  相似文献   

14.
We develop a general model for software development process and propose a policy to manage system coordination using system fault reports (e.g., interface inconsistencies, parameter mismatches, etc.). These reports are used to determine the timing of coordination activities that remove faults. We show that under an optimal policy, coordination should be performed only if a “threshold” fault count has been exceeded. We apply the policy to software development processes and compare the management of those projects under different development conditions. A series of numerical experiments are conducted to demonstrate how the fault threshold policy needs to be adjusted to changes in system complexity, team skill, development environment, and project schedule. Moreover, we compare the optimal fault threshold policy to an optimal release‐based policy. The release‐based policy does not take into account fault data and is easier to administer. The comparisons help to define the range of project parameters for which observing fault data can provide significant benefits for managing a software project.  相似文献   

15.
Calculation of Benchmark Doses from Continuous Data   总被引:20,自引:0,他引:20  
A benchmark dose (BMD) is the dose of a substance that corresponds to a prescribed increase in the response (called the benchmark response or BMR) of a health effect. A statistical lower bound on the benchmark dose (BMDL) has been proposed as a replacement for the no-observed-adverse-effect-level (NOAEL) in setting acceptable human exposure levels. A method is developed in this paper for calculating BMDs and BMDLs from continuous data in a manner that is consistent with those calculated from quantal data. The method involves defining an abnormal response, either directly by specifying a cutoff x0 that separates continuous responses into normal and abnormal categories, or indirectly by specifying the proportion P0 of abnormal responses expected among unexposed subjects. The method does not involve actually dichotomizing individual continuous responses into quantal responses, and in certain cases can be applied to continuous data in summarized form (e.g., means and standard deviations of continuous responses among subjects in discrete dose groups). In addition to specifying the BMR and either x0 or P0 , the method requires specification of the distribution of continuous responses, including specification of the dose-response θ(d) for a measure of central tendency. A method is illustrated for selecting θ(d) to make the probability of an abnormal response any desired dose-response function. This enables the same dose-response model (Weibull, log-logistic, etc.) to be used for the probability of an abnormal response, regardless of whether the underlying data are continuous or quantal. Whenever the continuous responses are normally distributed with standard deviation σ (independent of dose), the method is equivalent to defining the BMD as the dose corresponding to a prescribed change in the mean response relative to σ.  相似文献   

16.
When assessing risks posed by environmental chemical mixtures, whole mixture approaches are preferred to component approaches. When toxicological data on whole mixtures as they occur in the environment are not available, Environmental Protection Agency guidance states that toxicity data from a mixture considered “sufficiently similar” to the environmental mixture can serve as a surrogate. We propose a novel method to examine whether mixtures are sufficiently similar, when exposure data and mixture toxicity study data from at least one representative mixture are available. We define sufficient similarity using equivalence testing methodology comparing the distance between benchmark dose estimates for mixtures in both data‐rich and data‐poor cases. We construct a “similar mixtures risk indicator”(SMRI) (analogous to the hazard index) on sufficiently similar mixtures linking exposure data with mixtures toxicology data. The methods are illustrated using pyrethroid mixtures occurrence data collected in child care centers (CCC) and dose‐response data examining acute neurobehavioral effects of pyrethroid mixtures in rats. Our method shows that the mixtures from 90% of the CCCs were sufficiently similar to the dose‐response study mixture. Using exposure estimates for a hypothetical child, the 95th percentile of the (weighted) SMRI for these sufficiently similar mixtures was 0.20 (i.e., where SMRI <1, less concern; >1, more concern).  相似文献   

17.
Among the most important problems facing corporate planners is that of data: its availability: the management of uncertainly and risk; data presentation etc. With the exception of several articles on forecasting methods, the literature of corporate planning contains little on the subject of data. The purpose of this article is to present some ideas which will encourage further debate.  相似文献   

18.
数据挖掘技术在电信增值服务行业中的应用   总被引:6,自引:0,他引:6  
数据挖掘技术在企业市场营销领域得到了越来越广泛的应用,它能有效地帮助营销人员发现数据之间的内在有意义的联系,制定有效的营销计划,最终为企业带来更多的利润.本文运用聚类、logistic回归、决策树等数据挖掘技术对一家电信增值服务公司的客户数据进行分析,说明公司要以客户为导向,分析目标市场的需求特征,在成本领先和差异化战略的指引下,才能在短信增值市场上站稳脚跟.  相似文献   

19.
A Bayesian forecasting model is developed to quantify uncertainty about the postflight state of a field-joint primary O-ring (not damaged or damaged), given the O-ring temperature at the time of launch of the space shuttle Challenger in 1986. The crux of this problem is the enormous extrapolation that must be performed: 23 previous shuttle flights were launched at temperatures between 53 °F and 81 °F, but the next launch is planned at 31 °F. The fundamental advantage of the Bayesian model is its theoretic structure, which remains correct over the entire sample space of the predictor and that affords flexibility of implementation. A novel approach to extrapolating the input elements based on expert judgment is presented; it recognizes that extrapolation is equivalent to changing the conditioning of the model elements. The prior probability of O-ring damage can be assessed subjectively by experts following a nominal-interacting process in a group setting. The Bayesian model can output several posterior probabilities of O-ring damage, each conditional on the given temperature and on a different strength of the temperature effect hypothesis. A lower bound on, or a value of, the posterior probability can be selected for decision making consistently with expert judgment, which encapsulates engineering information, knowledge, and experience. The Bayesian forecasting model is posed as a replacement for the logistic regression and the nonparametric approach advocated in earlier analyses of the Challenger O-ring data. A comparison demonstrates the inherent deficiency of the generalized linear models for risk analyses that require (1) forecasting an event conditional on a predictor value outside the sampling interval, and (2) combining empirical evidence with expert judgment.  相似文献   

20.
This paper introduces time‐varying grouped patterns of heterogeneity in linear panel data models. A distinctive feature of our approach is that group membership is left unrestricted. We estimate the parameters of the model using a “grouped fixed‐effects” estimator that minimizes a least squares criterion with respect to all possible groupings of the cross‐sectional units. Recent advances in the clustering literature allow for fast and efficient computation. We provide conditions under which our estimator is consistent as both dimensions of the panel tend to infinity, and we develop inference methods. Finally, we allow for grouped patterns of unobserved heterogeneity in the study of the link between income and democracy across countries.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号