结合有监督广度优先搜索策略的通用垂直爬虫方法

doi:10.19678/j.issn.1000-3428.0048511

计算机工程 ›› 2018, Vol. 44 ›› Issue (11): 289-299. doi: 10.19678/j.issn.1000-3428.0048511

结合有监督广度优先搜索策略的通用垂直爬虫方法

高峰^a,刘震^a,b,高辉^a,b

电子科技大学 a.计算机科学与工程学院; b.大数据研究中心,成都 611731

收稿日期:2017-09-01 出版日期:2018-11-15 发布日期:2018-11-15
作者简介:高峰(1992—),男,硕士研究生,主研方向为数据库技术、数据挖掘;刘震,副教授、博士;高辉,教授。
基金资助:
国家自然科学基金(61300018)

General Vertical Crawler Method Combined with Supervised Breadth-First Search Strategy

GAO Feng^a,LIU Zhen^a,b,GAO Hui^a,b

a.School of Computer Science and Engineering; b.Big Data Research Center,University of Electronic Science and Technology of China,Chengdu 611731,China

Received:2017-09-01 Online:2018-11-15 Published:2018-11-15

摘要/Abstract

摘要：

垂直爬虫程序无法直接移植到其他网站并且程序设计需要大量人工干预。为此,提出一种高可移植性的通用型垂直爬虫设计方法。自动识别目标主题和目录页面URL,并利用URL聚类生成URL正则表达式过滤器,以解决垂直爬虫中需人工维护初始URL队列的问题。然后,利用正则表达式过滤器和解析路径模板以及有监督的广度优先与网页赋权搜索策略,实现相关页面的精确定位和数据的快速准确提取。实验结果表明,该方法能够对不同网站实现高效、快速、通用的数据爬取。

关键词: 垂直爬虫, URL聚类, 赋权网页, 路径模板解析, 有监督广度优先搜索策略

Abstract:

Aiming at the problem that the vertical crawler program can not be directly transplanted to other websites and the programming needs a lot of manual intervention,a highly portable and general vertical crawler design method is proposed.The target topic and directory page URL are automatically identified,and the URL regular expression filter is generated by URL clustering to solve the problem of manual maintenance of initial URL queue in vertical crawler.Using the regular expression filter,the parsing path template and the supervised breadth-first and Web page weighting search strategy,the accurate location of the relevant pages,the fast and accurate extraction of the data can be realized.Experimental results show that this method can achieve efficient,fast and universal data crawling for different websites.

Key words: vertical crawler, URL clustering, weighted Web page, parser of the path template, supervised breadth-first search strategy

中图分类号:

TP393

高峰,刘震,高辉. 结合有监督广度优先搜索策略的通用垂直爬虫方法[J]. 计算机工程, 2018, 44(11): 289-299.

GAO Feng,LIU Zhen,GAO Hui. General Vertical Crawler Method Combined with Supervised Breadth-First Search Strategy[J]. Computer Engineering, 2018, 44(11): 289-299.

http://www.ecice06.com/CN/Y2018/V44/I11/289

参考文献

［1］CRESCENZI V,MECCA G,MERIALDO P.RoadRunner:towards automatic data extraction from large Web sites［C］//Proceedings of the 27th International Conference on Very Large Data Bases.［S.l.］:Morgan Kaufmann Publishers,2001:109-118.
［2］LIU B,GROSSMAN R,ZHAI Y.Mining data records in Web pages［C］//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York,USA:ACM Press,2003:601-606.
［3］ZHAI Y,LIU B.Web data extraction based on partial tree alignment［C］//Proceedings of the 14th International Conference on World Wide Web.New York,USA:ACM Press,2005:76-85.
［4］BUTTLER D,LIU L,PU C.A fully automated object extraction system for the World Wide Web［C］//Proceedings of the 21st International Conference on Distributed Computing Systems.Washington D.C.,USA:IEEE Computer Society,2001:361-370.
［5］CHANG C,HSU C,LUI S,et al.Automatic information extraction from semi-structured Web pages by pattern discovery［J］.Decision Support Systems,2003,35(1):129-147.
［6］LU Y,HE H,ZHAO H,et al.Annotating structured data of the deep Web［C］//Proceedings of the 23rd International Conference on Data Engineering.Washington D.C.,USA:IEEE Press,2007:376-385.
［7］扬少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法［J］.软件学报,2008,19(2):209-223.
［8］LIU W,YAN H,XIAO J.Automatically mining review records from forum Web sites［C］//Proceedings of the 7th International Conference on Fuzzy Systems and Knowledge Discovery.Washington D.C.,USA:IEEE Press,2010:2450-2455.
［9］刘伟,严华梁,肖建国,等.一种Web评论自动抽取方法［J］.软件学报,2010,21(12):3220-3236.
［10］ZHAO H,MENG W,WU Z,et al.Fully automatic wrapper generation for search engines［C］//Proceedings of the 14th International Conference on World Wide Web.New York,USA:ACM Press,2005:66-75.
［11］KAI S,LAUSEN G.ViPER:augmenting automatic information extraction with visual perceptions［C］//Proceedings of the 14th ACM International Conference on Information and Knowledge Management.New York,USA:ACM Press,2005:381-388.
［12］GUPTA S,KAISER G,NEISTADT D,et al.DOM-based content extraction of HTML documents［C］//Proceedings of the 12th international conference on World Wide Web.New York,USA:ACM Press,2003:207-214.
［13］曹冬林,廖祥文,许洪波,等.基于网页格式信息量的博客文章和评论抽取模型［J］.软件学报,2009,20(5):1282-1291.
［14］MCCALLUM A,NIGAM K,RENNIE J,et al.Building domain-specific search engines with machine learning techniques［EB/OL］.［2017-08-01］.https://www.researchgate.net/publication/228738940_Building_Domain-Specific_Search_Engines_with_Machine_Learning_Techniques.
［15］BERGMARK D,LAGOZE C,SBITYAKOV A.Focused crawls,tunneling,and digital libraries［C］//Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries.Berlin,Germany:Springer,2002:91-106.

[1]	吴翼腾, 于洪涛, 顾泽宇. 基于统一描述网络结构模型的链路预测方法[J]. 计算机工程, 2022, 48(7): 51-58.
[2]	沈记全, 林帅, 李志莹. 基于局部域的影响力最大化算法[J]. 计算机工程, 2022, 48(7): 22-28.
[3]	任智, 杨迪, 胡春, 朱克兰. 一种高信道利用率的无人机自组网单阈值接入协议[J]. 计算机工程, 2021, 47(2): 206-211.
[4]	张宪立, 唐建新. 一种新的复杂网络节点重要性评估方法[J]. 计算机工程, 2021, 47(2): 139-145,151.
[5]	李莉, 宋嵩, 李冰珂. 基于用户偏好的权重搜索及告警选择方法[J]. 计算机工程, 2020, 46(4): 107-114.
[6]	朱婧, 伍忠东, 丁龙斌, 汪洋. SDN环境下基于DBN的DDoS攻击检测[J]. 计算机工程, 2020, 46(4): 157-161,182.
[7]	朱国晖, 刘璐, 雷兰洁. 基于VNF组合的服务功能链设计及映射算法[J]. 计算机工程, 2020, 46(4): 183-188,197.
[8]	蒋占军, 周涛, 杨永红. WSN中基于改进蚁群的能量优化路由算法[J]. 计算机工程, 2020, 46(4): 189-197.
[9]	孙中军, 翟江涛. 一种面向加密流量的网络应用识别方法[J]. 计算机工程, 2020, 46(4): 151-156.
[10]	徐锋, 王佶. 基于病毒-抗体免疫博弈的WSN链路稳定算法[J]. 计算机工程, 2020, 46(4): 206-212,235.
[11]	朱文锋, 王琴, 郭筝, 刘军荣. 针对分组密码的攻击方法研究[J]. 计算机工程, 2020, 46(1): 102-107,113.
[12]	郭悦, 王红军, 解梦奇. 基于果蝇算法的物联网节点定位技术研究[J]. 计算机工程, 2020, 46(1): 144-149.
[13]	丁伟, 张千风, 周文烽. 基于SDN的UDP反射攻击响应方案[J]. 计算机工程, 2020, 46(1): 121-128.
[14]	陈秋瑶, 郑烇. 命名数据网络中适用于区分服务的缓存策略研究[J]. 计算机工程, 2020, 46(1): 172-178.
[15]	田纪尧, 刘广钟. WSN中基于多因素的能量优化分簇路由算法[J]. 计算机工程, 2020, 46(1): 179-186.

选择文件类型/文献管理软件名称

选择包含的内容

结合有监督广度优先搜索策略的通用垂直爬虫方法

General Vertical Crawler Method Combined with Supervised Breadth-First Search Strategy

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

结合有监督广度优先搜索策略的通用垂直爬虫方法

General Vertical Crawler Method Combined with Supervised Breadth-First Search Strategy

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价