作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2020, Vol. 46 ›› Issue (9): 274-282. doi: 10.19678/j.issn.1000-3428.0055967

• 开发研究与工程应用 • 上一篇    下一篇

基于多目标蚁群算法的主题爬虫策略

东熠1, 刘景发2a,2b, 刘文杰1   

  1. 1. 南京信息工程大学 计算机与软件学院, 南京 210044;
    2. 广东外语外贸大学 a. 广州市非通用语种智能处理重点实验室;b. 信息科学与技术学院, 广州 510006
  • 收稿日期:2019-09-10 修回日期:2019-10-25 发布日期:2019-11-11
  • 作者简介:东熠(1994-),男,硕士研究生,主研方向为网络爬虫、智能计算;刘景发(通信作者),教授、博士、博士生导师;刘文杰,副教授、博士。
  • 基金资助:
    国家社会科学基金(16ZDA047);江苏省自然科学基金(BK20181409,BK20171458);广州市科技计划项目(202002030238);广州市非通用语种智能处理重点实验室专项(201905010008)。

Focused Crawler Strategy Based on Multi-Objective Ant Colony Algorithm

DONG Yi1, LIU Jingfa2a,2b, LIU Wenjie1   

  1. 1. School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing 210044, China;
    2a. Guangzhou Key Laboratory of Multilingual Intelligent Processing;2b. School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou 510006, China
  • Received:2019-09-10 Revised:2019-10-25 Published:2019-11-11

摘要: 基于关键词匹配检索的传统搜索引擎爬全率和爬准率较低,而使用基于语义检索的主题爬虫方法容易偏离主题与陷入局部最优。针对该问题,提出一种采用多目标蚁群优化算法的主题爬虫方法。构建主题爬虫领域本体和主题向量,以链接的锚文本相关度、链接所在网页主题相关度以及链接指向网页主题相关度为指标判断链接是否与主题相关,并建立链接主题相关度的多目标优化模型,将基于多目标优化的蚁群算法引入主题爬虫的链接选择过程,采用非支配排序和最近最远候选解法选取Pareto最优链接,以引导主题爬虫搜索方向并提升全局搜索性能。实验结果表明,与FCSA、WSE等传统主题爬虫方法相比,该方法爬准率更高,并且能更快抓取到主题相关度高的网页。

关键词: 主题爬虫, 蚁群算法, 多目标优化, 暴雨灾害, 本体构建

Abstract: The traditional search engine based on keyword matching retrieval often fail to ensure the completeness and accuracy of the scraped data,but using the focused crawler method based on semantic retrieval tend to deviate from the focuse and fall into local optimum.To solve the problem,this paper proposes a focused crawler method based on multi-objective ant colony optimization algorithm.The method constructs a focused crawler domain ontology and focused vector.Then whether the link is relevant to the focuse is determined based on the anchor text relevance of the link,the focused relevance of the Web page where the link is located and the focused relevance of the page pointed to by the link.The multi-objective optimization model for the focused relevance degree of the link is established.The ant colony algorithm based on multi-objective optimization is introduced into the link selection proess of the focused crawler,and the non-dominated sorting and the Nearest and Farthest Candidate Solution(NFCS) is adopted to select the Pareto optimal link in order to guide the search direction of the focused crawler and improve the global search performance.Experimental results show that compared with FCSA,WSE and other traditional focused crawler methods,the proposed method improves the completeness of scraped data and can capture the Web pages with high relevance to the focuse more quickly.

Key words: focused crawler, ant colony algorithm, multi-objective optimization, rainstorm disaster, ontology construction

中图分类号: