作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 先进计算与数据处理 • 上一篇    下一篇

基于MapReduce的并行PageRank算法实现

平 宇1,向 阳1,张 波2,黄寅飞3   

  1. (1. 同济大学计算机科学与技术系,上海 201804;2. 上海师范大学信息与机电工程学院,上海 200234;3. 上海证券交易所,上海 200120)
  • 收稿日期:2013-04-08 出版日期:2014-02-15 发布日期:2014-02-13
  • 作者简介:平 宇(1988-),男,硕士研究生,主研方向:数据库技术,数据挖掘;向 阳,教授、博士生导师;张 波,副教授;黄寅飞,高级工程师
  • 基金资助:
    国家自然科学基金资助项目(61103069, 71170148);国家科技支撑计划基金资助项目(2012BAD35B01);上海市科技创新计划基金资助项目(11DZ1501703);陈家镇智慧社区和智能交通基金资助项目(11dz1210600)

Implementation of Parallel PageRank Algoirthm Based on MapReduce

PING Yu 1, XIANG Yang 1, ZHANG Bo 2, HUANG Yin-fei 3   

  1. (1. Department of Computer Science and Technology, Tongji University, Shanghai 201804, China; 2. College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 200234, China;3. Shanghai Stock Exchange, Shanghai 200120, China)
  • Received:2013-04-08 Online:2014-02-15 Published:2014-02-13

摘要: 分布式网络爬虫的广泛应用使得搜索引擎的数据规模呈几何式增长,面对数以TB甚至PB量级的数据,单机模式下的PageRank算法由于CPU、I/O和内存的开销过大导致效率低下。为此,提出一种基于MapReduce框架的并行PageRank算法。在算法的一次迭代过程中,利用Map函数对网页拓扑信息文件进行解析,使用Reduce函数计算网页得分,从而并行化PageRank算法的中间迭代过程。通过计算全局网页得分控制迭代次数,得到较精确的网页排序结果。实验结果表明,该算法在保持原有单机PageRank算法整体网页排序精度的基础上,具有较好的集群性能和较快的执行速度。

关键词: 搜索引擎, PageRank算法, MapReduce框架, 并行计算, Hadoop平台

Abstract: The emergence of distributed Web crawl largely expands the scale of related Web information. Since PageRank needs to process the topology of entire existed page set, the limitation of CPU, I/O and memory becomes the big issue when it confronts the data in TB or PB level. Aiming at these problems, this paper proposes a parallel PageRank algorithm based on MapReduce. In a certain iteration of algorithm, it processes the files containing the topology of Web page graph by Map function and calculates the pages’ scores by Reduce function. Using the global Web page score as convergence to control iterations and get more precise Web page sorting result. Experimental result shows that the improved algorithm has better clustering performance and faster execution speed on the basis of keeping the overall Web page sorting accuracy of single machine PageRank algorithm.

Key words: search engine, PageRank algorithm, MapReduce framework, parallel computing, Hadoop platform

中图分类号: