首页 | 本学科首页   官方微博 | 高级检索  
     

基于Hadoop的网络爬虫技术研究
引用本文:王艳红,周军. 基于Hadoop的网络爬虫技术研究[J]. 吉林工程技术师范学院学报, 2014, 30(8): 87-89
作者姓名:王艳红  周军
作者单位:南通航运职业技术学院管理信息系,江苏南通,226010
摘    要:网络爬虫一般从一个起始网页开始,读取网页的内容和网页中的链接,依次循环下去,直到找到此网页所有的链接网页为止;当要爬取的数据量比较大时,传统的技术存在一定弊端,而Hadoop开源云计算框架在数据采集方面会有一定的优势。在介绍Hadoop云计算框架的基础上,本文阐述网络爬虫的原理,并实现基于Hadoop的网络爬虫。

关 键 词:Hadoop  网络爬虫  MapReduce  搜索引擎

Research on Web Crawler Technology Based on Hadoop
WANG Yan-hong,ZHOU Jun. Research on Web Crawler Technology Based on Hadoop[J]. Journal of Jilin Teachers Institute of Engineering and Technology(Natural Sciences Edition), 2014, 30(8): 87-89
Authors:WANG Yan-hong  ZHOU Jun
Affiliation:(Management Information Department, Nantong Shipping College, Nantong Jiangsu 226010, China)
Abstract:The Web crawler usually starts from a starting Webpage, reads the content of webpage and Webpage links, successively circles until it finds all the webpage links; when you want to climb from the large amount of data, the traditional technology has some disadvantages, and the Hadoop open source cloud computing framework will have a certain advantages in data acquisition. On the basis of the introduction of Hadoop cloud computing framework, this paper describes the principle of the web crawler and realization of the web crawler based on Hadoop.
Keywords:Hadoop  web crawler  MapReduce  search engine
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号