首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 265 毫秒
1.
提出基于最近邻插补和关联规则的缺失数据插补方法,将不含缺失数据的变量作为辅助变量,通过定义距离函数寻找与含缺失数据的样本单元距离较近的样本,然后利用挖掘得到的关联规则支持度和提升度乘积的倒数作为权重,对样本单元之间的距离进行加权处理,得到加权距离,再用加权距离最小的样本单元对应的属性值对缺失值进行插补。这种方法可以解决由不同最近距离样本单元得到不同插补值的问题,最后给出了该方法的实施步骤和应用范例。  相似文献   

2.
基于统计与聚类的信用评级新方法   总被引:1,自引:0,他引:1  
文章针对国内信用评级研究面临的违约率数据缺失的问题,提出了一种综合回归思想与聚类算法的方法.通过得到的模拟数据进行回归分析,得到了关于回归与聚类一致性和最优输入参数两个回归方程;以回归方程来指导聚类算法,并利用我国2012年226家上市公司的财务数据对方法的有效性进行了检验.  相似文献   

3.
薛薇 《统计研究》2002,19(4):52-53
一、概述数据挖掘是 90年代中后期兴起的一门跨学科的综合研究领域 ,它集计算机机器学习、统计学、数据库管理、数据仓库、可视化、并行计算、决策支持为一体 ,利用数据库、数据仓库技术存储和管理数据 ,利用机器学习和统计学方法分析数据 ,旨在发现大量复杂数据中蕴含的有价值的知识和信息。目前 ,随着数据挖掘应用的不断开展以及客观现实对数据分析需求的不断增长 ,人们越来越认识到数据挖掘的重要性和必要性。数据挖掘通过对数据的总结、分类、聚类、关联等分析 ,实现对数据内在结构特征的理解和对未知数据的预测。其中 ,数据总结是在数…  相似文献   

4.
多指标面板数据能够较全面的提供研究对象的信息和数据特征,但复杂的数据结构也给其聚类分析带来了一定的困难.针对这一问题,文章提出了基于特征提取的多指标面板数据聚类方法,该方法将能够表征面板数据动态变化的“绝对量”特征、“波动”特征、“偏度”特征、“峰度”特征及“趋势”特征引入动态聚类算法中,可以避免以往采用欧式距离进行聚类的局限性,还可以处理带有缺失数据的面板数据,同时大大提高了聚类效率,并最大限度地保证时间维度信息不受损失.利用该方法分析了2001至2013年我国不同省份道路交通事故的不平衡状况,通过实证分析表明该方法能够解决多指标面板数据聚类的问题.  相似文献   

5.
以内蒙古自治区12个盟市的绿色资源环境发展为研究对象,采用灰色动态聚类与粗糙集相结合的方法,构建包含有全年供水量等11个指标的内蒙古自治区绿色资源环境指标体系,其要点在于:一是通过灰色关联分析建立样本间的灰色关联矩阵,进而进行样本间的灰色聚类,反映样本间的信息重复性;二是采用动态聚类方法,每次去除一个指标后继续通过灰色关联分析建立的灰色关联矩阵进行灰色样本聚类,为粗糙集约简提供信息数据;三是通过粗糙集约简理论判断海选指标对聚类结果的影响是否显著,将每一次的聚类结果与原始聚类结果比较,保留两次聚类结果不同且对评价样本分类有显著影响的海选指标;四是采用非参数Kruska-Wallis检验的P值检验法证明本文构建的指标体系是合理的。通过对比分析表明,本文的灰色动态聚类-粗糙集指标筛选模型优于现有研究的聚类-灰色关联指标筛选模型。  相似文献   

6.
基于统计模型的模糊聚类算法的时间复杂度在数据集规模超过一定数量级时是计算不可行的,解决时间复杂度的一个行之有效的方法是抽样.文章通过对静态抽样进行改进,设计了一种半静态抽样法,使样本数据集最大程度得保持原数据集的信息,并保证聚类结果的不失真性;最后通过实证分析,比较并证明了该方法是有效的.  相似文献   

7.
对由多个指标组成的多元数据进行聚类分析时,数据维度的增加、各指标与总体聚类的相关性程度不一致以及各指标服从的分布不同会增加聚类的复杂性,影响聚类结果的准确性,因此需要通过合适的方法来对多元数据进行聚类分析。针对这一问题,提出改进的带粘性的层次Dirichlet过程(sticky Hierarchical Dirichlet Process)方法来实现对多元数据的降维聚类,以解决各指标服从不同分布的问题,并用粘性参数反映各指标与总体聚类之间的相关性。用MCMC方法来估计模型参数。通过对仿真模拟数据和IRIS数据集的聚类分析,证实了该方法的有效性,同时发现单个指标与总体聚类的相关性越大,则相应的粘性参数越大,从而反映该指标在总体聚类中的重要性程度越高;并且当各指标数据中有粘性较大的指标时,带粘性的层次Dirichlet过程方法明显优于其他聚类方法,能够显著提高分类的准确性。  相似文献   

8.
针对不平衡数据集中的少数类样本在实际应用中分类准确率较低的问题,提出一种利用多数类样本的自然最近邻进行欠采样的数据处理方法。自然最近邻算法根据每个样本的分布特征动态地为样本选择数量不同的自然最近邻样本,通过自然最近邻的个数反映样本分布的疏密程度。文章所提方法先计算多数类样本在整体数据集中的自然最近邻,根据自然最近邻情况移除多数类中的噪声样本和局部密度较小的样本,再计算剩余样本的相似度,保留密集区域中的代表性样本,去掉部分冗余样本,获得平衡数据集。该方法的计算无须预先指定参数,减少了欠采样过程中多数类分类信息的损失。对比实验利用支持向量机对不同欠采样方法平衡后的12个数据集进行分类,结果表明此方法在大多数数据集上具有较优的分类性能,提升了少数类样本的分类准确率。  相似文献   

9.
聚类有效性指标是评价一种聚类方法划分质量和确定最佳聚类数目的重要工具.文章提出了一种新的聚类有效性指标——T指标,该有效性指标利用最小生成树思想计算类内内聚度,在计算的过程中不再与聚类中心发生直接联系.经反复实验证明新的有效性指标对各种形状分布的划分均有良好的评价表现,且能正确确定各种重叠度数据集的聚类数目.  相似文献   

10.
基于数据分布密度划分的聚类算法是数据挖掘聚类算法中的主要方法之一。针对传统密度划分聚类算法存在运算复杂、运行效率不高等缺陷,设计出高维分步投影的多重分区聚类算法;以高维分布投影密度为依据,对数据集进行多重分区产生数据集的子簇空间,并进行子簇合并形成了理想的聚类结果;依据算法进行实验,结果证明该算法具有运算简单和运行效率高等优良性。  相似文献   

11.
Real world applications of association rule mining have well-known problems of discovering a large number of rules, many of which are not interesting or useful for the application at hand. The algorithms for closed and maximal itemsets mining significantly reduce the volume of rules discovered and complexity associated with the task, but the implications of their use and important differences with respect to the generalization power, precision and recall when used in the classification problem have not been examined. In this paper, we present a systematic evaluation of the association rules discovered from frequent, closed and maximal itemset mining algorithms, combining common data mining and statistical interestingness measures, and outline an appropriate sequence of usage. The experiments are performed using a number of real-world datasets that represent diverse characteristics of data/items, and detailed evaluation of rule sets is provided as a whole and w.r.t individual classes. Empirical results confirm that with a proper combination of data mining and statistical analysis, a large number of non-significant, redundant and contradictive rules can be eliminated while preserving relatively high precision and recall. More importantly, the results reveal the important characteristics and differences between using frequent, closed and maximal itemsets for the classification task, and the effect of incorporating statistical/heuristic measures for optimizing such rule sets. With closed itemset mining already being a preferred choice for complexity and redundancy reduction during rule generation, this study has further confirmed that overall closed itemset based association rules are also of better quality in terms of classification precision and recall, and precision and recall on individual class examples. On the other hand maximal itemset based association rules, that are a subset of closed itemset based rules, show to be insufficient in this regard, and typically will have worse recall and generalization power. Empirical results also show the downfall of using the confidence measure at the start to generate association rules, as typically done within the association rule framework. Removing rules that occur below a certain confidence threshold, will also remove the knowledge of existence of any contradictions in the data to the relatively higher confidence rules, and thus precision can be increased by disregarding contradictive rules prior to application of confidence constraint.  相似文献   

12.
In this paper, we investigate the relative merits of rain rules for one-day cricket matches. We suggest that interrupted one-day matches present a missing data problem: the outcome of the complete match cannot be observed, and instead the outcome of the interrupted match, as determined at least in part by the rain rule in question, is observed. Viewing the outcome of the interrupted match as an imputation of the missing outcome of the complete match, standard characteristics to assess the performance of classification tests can be used to assess the performance of a rain rule. In particular, we consider the overall and conditional accuracy and the predictive value of a rain rule. We propose two requirements for a ‘fair’ rain rule, and show that a fair rain rule must satisfy an identity involving its conditional accuracies. Estimating the performance characteristics of various rain rules from a sample of complete one-day matches our results suggest that the Duckworth–Lewis method, currently adopted by the International Cricket Council, is essentially as accurate as and somewhat more fair than its best competitors. A rain rule based on the iso-probability principle also performs well but might benefit from re-calibration using a more representative data base.  相似文献   

13.
DNA microarrays allow for measuring expression levels of a large number of genes between different experimental conditions and/or samples. Association rule mining (ARM) methods are helpful in finding associational relationships between genes. However, classical association rule mining (CARM) algorithms extract only a subset of the associations that exist among different binary states; therefore can only infer part of the relationships on gene regulations. To resolve this problem, we developed an extended association rule mining (EARM) strategy along with a new way of the association rule definition. Compared with the CARM method, our new approach extracted more frequent genesets from a public microarray data set. The EARM method discovered some biologically interesting association rules that were not detected by CARM. Therefore, EARM provides an effective tool for exploring relationships among genes.  相似文献   

14.
Technological advances in genotyping have given rise to hypothesis-based association studies of increasing scope. As a result, the scientific hypotheses addressed by these studies have become more complex and more difficult to address using existing analytic methodologies. Obstacles to analysis include inference in the face of multiple comparisons, complications arising from correlations among the SNPs (single nucleotide polymorphisms), choice of their genetic parametrization and missing data. In this paper we present an efficient Bayesian model search strategy that searches over the space of genetic markers and their genetic parametrization. The resulting method for Multilevel Inference of SNP Associations, MISA, allows computation of multilevel posterior probabilities and Bayes factors at the global, gene and SNP level, with the prior distribution on SNP inclusion in the model providing an intrinsic multiplicity correction. We use simulated data sets to characterize MISA's statistical power, and show that MISA has higher power to detect association than standard procedures. Using data from the North Carolina Ovarian Cancer Study (NCOCS), MISA identifies variants that were not identified by standard methods and have been externally "validated" in independent studies. We examine sensitivity of the NCOCS results to prior choice and method for imputing missing data. MISA is available in an R package on CRAN.  相似文献   

15.
基于关联规则挖掘的股票板块指数联动分析   总被引:7,自引:0,他引:7  
本文应用数据挖掘中的关联规则算法,对我国股票市场中板块指数的关系进行实证分析。通过采用关联规则的Apriori算法技术,可以从大量数据中挖掘出我国股票市场的板块指数之间的强关联规则,并对其进行可视化描述与评价。文中所得的关联规则可以帮助市场参与者发现股票板块轮动的规则模式,并在此基础上规避证券市场风险。  相似文献   

16.
A multinomial classification rule is proposed based on a prior-valued smoothing for the state probabilities. Asymptotically, the proposed rule has an error rate that converges uniformly and strongly to that of the Bayes rule. For a fixed sample size the prior-valued smoothing is effective in obtaining reason¬able classifications to the situations such as missing data. Empirically, the proposed rule is compared favorably with other commonly used multinomial classification rules via Monte Carlo sampling experiments  相似文献   

17.
We display pseudo-likelihood as a special case of a general estimation technique based on proper scoring rules. Such a rule supplies an unbiased estimating equation for any statistical model, and this can be extended to allow for missing data. When the scoring rule has a simple local structure, as in many spatial models, the need to compute problematic normalising constants is avoided. We illustrate the approach through an analysis of data on disease in bell pepper plants.  相似文献   

18.
There is a growing demand for public use data while at the same time there are increasing concerns about the privacy of personal information. One proposed method for accomplishing both goals is to release data sets that do not contain real values but yield the same inferences as the actual data. The idea is to view confidential data as missing and use multiple imputation techniques to create synthetic data sets. In this article, we compare techniques for creating synthetic data sets in simple scenarios with a binary variable.  相似文献   

19.
Multiple imputation (MI) is an increasingly popular method for analysing incomplete multivariate data sets. One of the most crucial assumptions of this method relates to mechanism leading to missing data. Distinctness is typically assumed, which indicates a complete independence of mechanisms underlying missingness and data generation. In addition, missing at random or missing completely at random is assumed, which explicitly states under which conditions missingness is independent of observed data. Despite common use of MI under these assumptions, plausibility and sensitivity to these fundamental assumptions have not been well-investigated. In this work, we investigate the impact of non-distinctness and non-ignorability. In particular, non-ignorability is due to unobservable cluster-specific effects (e.g. random-effects). Through a comprehensive simulation study, we show that MI inferences suggest that nonignoriability due to non-distinctness do not immediately imply dismal performance while non-ignorability due to missing not at random leads to quite subpar performance.  相似文献   

20.
k-POD: A Method for k-Means Clustering of Missing Data   总被引:1,自引:0,他引:1  
The k-means algorithm is often used in clustering applications but its usage requires a complete data matrix. Missing data, however, are common in many applications. Mainstream approaches to clustering missing data reduce the missing data problem to a complete data formulation through either deletion or imputation but these solutions may incur significant costs. Our k-POD method presents a simple extension of k-means clustering for missing data that works even when the missingness mechanism is unknown, when external information is unavailable, and when there is significant missingness in the data.

[Received November 2014. Revised August 2015.]  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号