首页 | 本学科首页   官方微博 | 高级检索  
     检索      

结构和内容联合提取的XML网页分类研究
引用本文:阎红灿,李敏强,任蕴丽,阎少宏.结构和内容联合提取的XML网页分类研究[J].天津大学学报(社会科学版),2009,11(3):272-276.
作者姓名:阎红灿  李敏强  任蕴丽  阎少宏
作者单位:1. 天津大学管理学院,天津,300072;河北理工大学理学院,唐山,063009
2. 天津大学管理学院,天津,300072
3. 河北科技师范学院数理系,秦皇岛,066004
4. 河北理工大学理学院,唐山,063009
基金项目:高等学校博士学科点专项科研基金 
摘    要:针对XML网页特点,重点研究了XML文档结构和内容特征的提取方法,提出了一种基于频繁结构层次空间模型的联合特征提取策略,并给出了结构特征权重和关键词出现的位置及频度权重的计算公式,并根据计算结果提取XML网页特征矩阵,分别就结构、内容联合提取三种情况进行分类测试,通过ROSSETA系统,利用粗糙集优越的属性约简构造文本分类系统,实现XML文档分类。实验表明,该方法分类准确度较高,计算量较小。

关 键 词:XML网页分类  频繁结构层次空间模型  联合特征提取  粗糙集  网页特征矩阵

Study XML Pages Classification Based on Combined Structure and Content Extraction
YAN Hong-can,LI Min-qiang,REN Yun-li,YAN Shao-hong.Study XML Pages Classification Based on Combined Structure and Content Extraction[J].Journal of Tianjin University(Social Sciences),2009,11(3):272-276.
Authors:YAN Hong-can  LI Min-qiang  REN Yun-li  YAN Shao-hong
Institution:1. School of Management, Tianjin University, Tianjin 300072, China; 2. College of Sciences, Hebei Polytechnic University, Tangshan 063009, China; 3. Department of Mathematics and Physics, Hebei Normal Univercity of Science and Technology, Qinhuangdao 066004, China )
Abstract:According to the feature of XML Web page,we researched the method to extract structure and content features from XML documents, propased an efficient strategy of extracting combined features based on frequency structure hierarchy space model, provided the calculating method of the feature weight of structure, the position weight and the frequency of keywords, and then obtained the Web page feature matrix. Three instances of classification based on structure,content,and combined structure and content were experimented on separately by ROSSEATA system, using the superior reduction of the rough sets to construct a text categorization system. The experiments show that the classification has high accuracy ,but costs less time.
Keywords:XML page classification  frequent structure hierarchy space model  combined feature extraction  rough set  Web page feature matrix
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号