首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
丁冲  范钧  栾添 《统计教育》2008,(12):8-12,7
图像挖掘是数据挖掘领域中新兴的领域。随着数字照相技术的发展和在多学科中的广泛应用,对大量图像数据的分析和研究越来越重要。由于图像挖掘的对象、内容不同于传统数据,方法上也不同于传统技术。本文旨在介绍图像挖掘的基本概念和体系以及国际上最新的研究成果。本文回顾了图像挖掘的相关问题及建模框架,并与模式识别、图像处理等相关领域进行了比较,在此基础上,还介绍了近年来图像挖掘领域在卫星遥感、医学影像和生物显微照片研究的相关应用。  相似文献   

2.
数据流挖掘技术是数据挖掘技术的新研究方向之一。文章介绍了数据流、数据流挖掘的特点,对现有的数据流挖掘算法进行了总结、分析,提出了数据流挖掘的研究方向和应用前景。  相似文献   

3.
在研究一组相关总体的数量特征、总体间数量特征关系和总体间的交互作用时,如果从每一总体中抽取一组截面数据,就形成了一种不同于截面数据、时间序列数据和面板数据的特殊数据类型——双截面数据,虽然现有数据处理方法可以为双截面数据提供“面板化”、“平行化”和方程结构耐抗性检验等处理思路,但鉴于双截面数据的特殊性,每种思路都有不完善的地方,充分挖掘双截面数据所隐含信息的针对性处理方法仍需进一步探索。  相似文献   

4.
面板数据的聚类分析及其应用   总被引:19,自引:0,他引:19       下载免费PDF全文
 不同于传统的计量建模分析,本文探讨了多元统计方法在面板数据分析上的运用。文中介绍了面板数据的统计描述方法,构造了面板数据之间相似性的统计指标,并在此基础上提出了面板数据聚类分析的有效方法,通过实际应用取得了良好的效果。  相似文献   

5.
唐晓彬等 《统计研究》2021,38(8):146-160
本文创新地将半监督交互式关键词提取算法词频-逆向文件频率( Term Frequency- Inverse Document Frequency, TF-IDF )与基于 Transformer 的 双 向 编 码 表 征 ( Bidirectional Encoder Representation from Transformers,BERT)模型相结合,设计出一种扩展CPI预测种子关键词的文本挖掘技术。采用交互式TF-IDF算法,对原始CPI预测种子关键词汇广度上进行扩展,在此基础上通过BERT“两段式”检索过滤模型深入挖掘文本信息并匹配关键词,实现CPI预测关键词深度上的扩展,从而构建了CPI预测的关键词库。在此基础上,本文进一步对文本挖掘技术特征扩展前后的关键词建立预测模型进行对比分析。研究表明,相比于传统的关键词提取算法,交互式TF-IDF算法不仅无需借助语料库,而且还允许种子词的输入。同时,BERT模型通过迁移学习的方式对基础模型进行微调,学习特定领域知识,在CPI预测问题中很好地实现了语言表征、语义拓展与人机交互。相对于传统文本挖掘技术,本文设计的文本挖掘技术具有较强的泛化表征能力,在84个CPI预测关键种子词的基础上,扩充后的关键词对CPI具有更高的预测准确度和更充分的解释性。本文针对CP 预测问题设计的文本挖掘技术,也为建立其他宏观经济指标关键词词库提供新的研究思路与参考价值。  相似文献   

6.
大数据的冲击并非是对以样本数据为对象的统计学的颠覆,而是对现代统计学的扩展。本文结合大数据的相关特征,以数据经济价值的扩展为切入点,从数据价值挖掘的角度论证了数据挖掘与大数据分析的关系,探讨了大数据背景下数据衍生品的创造与数据工程学创建的必要性。在此基础上,参照“金融工程学”的概念及学科体系,对“数据工程学”的概念进行了界定,并对数据工程学学科体系构建的相关理论基础、主要研究内容与分析技术进行了归纳与说明。  相似文献   

7.
我们认为,收益率分布作为投资决策行为在市场交易中的直接后果,必然蕴涵着投资决策的行为特征.在我们前期的研究中,建立了基于行为理论的收益率分布模型,利用该模型考察投资者对小概率事件的反应程度,在投资行为模型基础上,可以建立有效的投资策略.本文在行为收益率分布模型和投资行为模型的基础上,通过考察投资者对小概率事件的反应程度,首次提出了不同于传统研究中采用收益率代表投资者的反应程度的方法,建立了不同于传统动量策略和反向策略的回复策略.利用沪市数据对本文提出的回复策略与传统动量、反向策略进行了实证比较研究.  相似文献   

8.
秦磊  谢邦昌 《统计研究》2016,33(2):107-110
大数据时代下机遇与挑战并存,如何基于传统方法去处理大数据引人深思,一味地追求大数据也不一定正确。本文以谷歌流感趋势(GFT)为案例,介绍了大数据在疾病疫情监测方面的主要技术及相关成果,阐述了大数据在使用中的关键问题,并结合复杂的统计学工具给出了一些改进措施。谷歌流感趋势的成功取决于相关关系的应用,其失误却来源于模型的构造、因果关系和相关关系的冲突等问题。谷歌流感趋势案例的分析与启示对政府今后在大数据解决方案中有重要的理论和实践意义。  相似文献   

9.
唐晓彬等 《统计研究》2020,37(7):104-115
消费者信心指数等宏观经济指标具有时间上的滞后效应和动态变化的多维性,不易精确预测。本文基于机器学习长短时间记忆(Long Short-Term Memory,LSTM)神经网络模型,结合大数据技术挖掘消费者信心指数相关网络搜索数据(User Search,US),进而构建一种LSTM&US预测模型,并将其应用于对我国消费者信心指数的长期、中期与短期的预测研究,同时引入多个基准预测模型进行了对比分析。结果发现:引入网络搜索数据能够提高LSTM神经网络模型的预测性能与预测精度;LSTM&US预测模型具有较好的泛化能力,对不同期限的预测效果均较稳定,其预测性能与预测精度均优于其他六种基准预测模型(LSTM、SVR&US、RFR&US、BP&US、XGB&US和LGB&US);预测结果显示本文提出的LSTM&US预测模型具有一定的实用价值,该预测方法为消费者信心指数的预测与预判提供了一种新的研究思路,丰富了机器学习方法在宏观经济指标预测领域中的理论研究。  相似文献   

10.
旅游需求预测是旅游研究中的一个新兴重要研究领域,在我国刚刚处于起步阶段.本文在考察了数十篇国外前沿研究成果的基础上,对定量预测技术中的单方程模型在旅游需求预测中的应用做以研究和介绍.力求为中国相关领域研究提供借鉴.……  相似文献   

11.
The support vector machine (SVM) is a popularly used classifier in applications such as pattern recognition, texture mining and image retrieval owing to its flexibility and interpretability. However, its performance deteriorates when the response classes are imbalanced. To enhance the performance of the support vector machine classifier in the imbalanced cases we investigate a new two stage method by adaptively scaling the kernel function. Based on the information obtained from the standard SVM in the first stage, we conformally rescale the kernel function in a data adaptive fashion in the second stage so that the separation between two classes can be effectively enlarged with incorporation of observation imbalance. The proposed method takes into account the location of the support vectors in the feature space, therefore is especially appealing when the response classes are imbalanced. The resulting algorithm can efficiently improve the classification accuracy, which is confirmed by intensive numerical studies as well as a real prostate cancer imaging data application.  相似文献   

12.
The author proposes a new method for flexible regression modeling of multi‐dimensional data, where the regression function is approximated by a linear combination of logistic basis functions. The method is adaptive, selecting simple or more complex models as appropriate. The number, location, and (to some extent) shape of the basis functions are automatically determined from the data. The method is also affine invariant, so accuracy of the fit is not affected by rotation or scaling of the covariates. Squared error and absolute error criteria are both available for estimation. The latter provides a robust estimator of the conditional median function. Computation is relatively fast, particularly for large data sets, so the method is well suited for data mining applications.  相似文献   

13.
Real world applications of association rule mining have well-known problems of discovering a large number of rules, many of which are not interesting or useful for the application at hand. The algorithms for closed and maximal itemsets mining significantly reduce the volume of rules discovered and complexity associated with the task, but the implications of their use and important differences with respect to the generalization power, precision and recall when used in the classification problem have not been examined. In this paper, we present a systematic evaluation of the association rules discovered from frequent, closed and maximal itemset mining algorithms, combining common data mining and statistical interestingness measures, and outline an appropriate sequence of usage. The experiments are performed using a number of real-world datasets that represent diverse characteristics of data/items, and detailed evaluation of rule sets is provided as a whole and w.r.t individual classes. Empirical results confirm that with a proper combination of data mining and statistical analysis, a large number of non-significant, redundant and contradictive rules can be eliminated while preserving relatively high precision and recall. More importantly, the results reveal the important characteristics and differences between using frequent, closed and maximal itemsets for the classification task, and the effect of incorporating statistical/heuristic measures for optimizing such rule sets. With closed itemset mining already being a preferred choice for complexity and redundancy reduction during rule generation, this study has further confirmed that overall closed itemset based association rules are also of better quality in terms of classification precision and recall, and precision and recall on individual class examples. On the other hand maximal itemset based association rules, that are a subset of closed itemset based rules, show to be insufficient in this regard, and typically will have worse recall and generalization power. Empirical results also show the downfall of using the confidence measure at the start to generate association rules, as typically done within the association rule framework. Removing rules that occur below a certain confidence threshold, will also remove the knowledge of existence of any contradictions in the data to the relatively higher confidence rules, and thus precision can be increased by disregarding contradictive rules prior to application of confidence constraint.  相似文献   

14.
This article describes a method for computing approximate statistics for large data sets, when exact computations may not be feasible. Such situations arise in applications such as climatology, data mining, and information retrieval (search engines). The key to our approach is a modular approximation to the cumulative distribution function (cdf) of the data. Approximate percentiles (as well as many other statistics) can be computed from this approximate cdf. This enables the reduction of a potentially overwhelming computational exercise into smaller, manageable modules. We illustrate the properties of this algorithm using a simulated data set. We also examine the approximation characteristics of the approximate percentiles, using a von Mises functional type approach. In particular, it is shown that the maximum error between the approximate cdf and the actual cdf of the data is never more than 1% (or any other preset level). We also show that under assumptions of underlying smoothness of the cdf, the approximation error is much lower in an expected sense. Finally, we derive bounds for the approximation error of the percentiles themselves. Simulation experiments show that these bounds can be quite tight in certain circumstances.  相似文献   

15.
For a higher education public institution, young in relative terms, featuring local competition with another private and both long-established and reputed one, it is of great importance to become a reference university institution to be better known and felt with identification in the society it belongs to and ultimately to reach a good position within the European Higher Education Area. These considerations have made the university governors setting up the objective of achieving an adequate management of the university institutional brand focused on its logo and on image promotion, leading to the establishment of a university shop as it is considered a highly adequate instrument for such promotion. In this context, an on-line survey is launched on three different kinds of members of the institution, resulting in a large data sample. Different kinds of variables are analysed through appropriate exploratory multivariate techniques (symmetrical methods) and regression-related techniques (non-symmetrical methods). An advocacy for such combination is given as a conclusion. The application of statistical techniques of data and text mining provides us with empirical insights about the institution members’ perceptions and helps us to extract some facts valuable to establish policies that would improve the corporate identity and the success of the corporate shop.  相似文献   

16.
A Bayesian multi-category kernel classification method is proposed. The algorithm performs the classification of the projections of the data to the principal axes of the feature space. The advantage of this approach is that the regression coefficients are identifiable and sparse, leading to large computational savings and improved classification performance. The degree of sparsity is regulated in a novel framework based on Bayesian decision theory. The Gibbs sampler is implemented to find the posterior distributions of the parameters, thus probability distributions of prediction can be obtained for new data points, which gives a more complete picture of classification. The algorithm is aimed at high dimensional data sets where the dimension of measurements exceeds the number of observations. The applications considered in this paper are microarray, image processing and near-infrared spectroscopy data.  相似文献   

17.
18.
On multivariate Gaussian copulas   总被引:1,自引:0,他引:1  
Gaussian copulas are handy tool in many applications. However, when dimension of data is large, there are too many parameters to estimate. Use of special variance structure can facilitate the task. In many cases, especially when different data types are used, Pearson correlation is not a suitable measure of dependence. We study the properties of Kendall and Spearman correlation coefficients—which have better properties and are invariant under monotone transformations—used at the place of Pearson coefficients. Spearman correlation coefficient appears to be more suitable for use in such complex applications.  相似文献   

19.
Bayesian networks (BNs) are probabilistic expert systems which have emerged over the last few decades as a powerful data mining technique. Also, BNs have become especially popular in biomedical applications where they have been used for diagnosing diseases and studying complex cellular networks, among many other applications. In this study, we built a BN in a fully automated way in order to analyse data regarding injuries due to the inhalation, ingestion and aspiration of foreign bodies (FBs) in children. Then, a sensitivity analysis was carried out to characterize the uncertainty associated with the model. While other studies focused on characteristics such as shape, consistency and dimensions of the FBs which caused injuries, we propose an integrated environment which makes the relationships among the factors underlying the problem clear. The advantage of this approach is that it gives a picture of the influence of critical factors on the injury severity and allows for the comparison of the effect of different FB characteristics (volume, FB type, shape and consistency) and children's features (age and gender) on the risk of experiencing a hospitalization. The rates it consents to calculate provide a more rational basis for promoting care-givers’ education of the most influential risk factors regarding the adverse outcomes.  相似文献   

20.
Clustering algorithms are important methods widely used in mining data streams because of their abilities to deal with infinite data flows. Although these algorithms perform well to mining latent relationship in data streams, most of them suffer from loss of cluster purity and become unstable when the inputting data streams have too many noisy variables. In this article, we propose a clustering algorithm to cluster data streams with noisy variables. The result from simulation shows that our proposal method is better than previous studies by adding a process of variable selection as a component in clustering algorithms. The results of two experiments indicate that clustering data streams with the process of variable selection are more stable and have better purity than those without such process. Another experiment testing KDD-CUP99 dataset also shows that our algorithm can generate more stable result.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号