首页 | 本学科首页   官方微博 | 高级检索  
     检索      

数据仓库中的相似重复记录检测方法
引用本文:李星毅,包从剑,施化吉.数据仓库中的相似重复记录检测方法[J].电子科技大学学报(社会科学版),2007(6).
作者姓名:李星毅  包从剑  施化吉
作者单位:北京交通大学电子信息学院,江苏大学计算机科学与通信工程学院,江苏大学计算机科学与通信工程学院 北京海淀区10004 江苏大学计算机科学与通信工程学院江苏镇江212013,江苏镇江212013,江苏镇江212013
基金项目:国家火炬计划项目(2004EB33006[0]),江苏省高校自然科学指导性计划项目(05JKD520050)
摘    要:针对检测和消除数据仓库中的相似重复记录问题,提出了数据仓库中的相似重复记录检测方法。该方法先通过等级法计算每个字段的权值;然后,按照分组思想,选择关键字段或字段某些位将大数据集分割成许多不相交的小数据集;最后,在各个小数据集中检测和消除相似重复记录,为避免漏查,再选择其他关键字段或字段某些位重复多次检测。理论分析和实验表明,该方法不仅具有好的检测精度,而且具有很好的时间效率,能够有效地解决大数据量的相似重复记录检测问题。

关 键 词:相似重复记录  数据仓库  分组  等级法  数据加权

A Method for Detecting Approximately Duplicate Database Records in Data Warehouse
LI Xing-yi,BAO Cong-jian,SHI Hua-ji.A Method for Detecting Approximately Duplicate Database Records in Data Warehouse[J].Journal of University of Electronic Science and Technology of China(Social Sciences Edition),2007(6).
Authors:LI Xing-yi    BAO Cong-jian  SHI Hua-ji
Institution:LI Xing-yi1,2,BAO Cong-jian2,SHI Hua-ji2
Abstract:Detecting and eliminating approximately duplicated records is one of the main problems needed to be solved for data mining and data quality improvement. An algorithm for detecting approximately duplicated database records is presented based on rank group. Firstly, each property of the data is endowed with certain weight according rank-based weights method. Secondly, in term of group thought, large data sets are divided into many non-intersect small data sets. Finally, approximately duplicated records are detected and eliminated in each small data set. To avoid missing, the above steps can be repeated. The theory analysis and experiment show that this algorithm has a good detecting precision better efficiency of time, and therefore is an effective approach to solve approximately duplicate records of massive data.
Keywords:approximately duplicated records  data warehouse  group  rank method  weighted data
本文献已被 CNKI 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号