基于SVM与DOM重心半径模型的Web正文提取

doi:10.19678/j.issn.1000-3428.0050677

计算机工程 ›› 2019, Vol. 45 ›› Issue (6): 206-210. doi: 10.19678/j.issn.1000-3428.0050677

基于SVM与DOM重心半径模型的Web正文提取

易国洪^a,b,代瑜^a,冯智莉^a,黎慧源^a

武汉工程大学 a.计算机科学与工程学院; b.智能机器人湖北省重点实验室,武汉 430205

收稿日期:2018-03-08 出版日期:2019-06-15 发布日期:2019-06-15
作者简介:易国洪(1972—),男,副教授、硕士,主研方向为数据挖掘、软件工程、Web语义;代瑜(通信作者)、冯智莉、黎慧源,硕士研究生。
基金资助:
国家自然科学基金青年项目“基于能力集成动态规划的自适应软件需求的群体决策问题研究”(61502355)。

Web content extraction based on SVM and gravity radius model of DOM

YI Guohong^a,b,DAI Yu^a,FENG Zhili^a,LI Huiyu^a

a.School of Computer Science and Engineering;b.Hubei Provincial Key Laboratory of Intelligent Robot,Wuhan Institute of Technology,Wuhan 430205,China

Received:2018-03-08 Online:2019-06-15 Published:2019-06-15

摘要/Abstract

摘要：

为了从网页中精确地提取正文内容,提出一种基于支持向量机(SVM)与DOM重心半径模型的算法。通过SVM对网页DOM节点集进行提取,得出文本块节点。根据网页链接信息和初次提取的文本块节点计算重心半径,利用重心半径模型进行二次精确提取,并给出相应的公式推导和超参数选取过程。实验结果表明,与统计抽取、FFT抽取等算法相比,该算法的准确率和提取效率较高,泛化能力较好。

关键词: 支持向量机, 特征向量, 重心半径, 网页, 正文提取

Abstract:

To extract the content from a Web page accurately,an algorithm based on Support Vector Machine(SVM) and gravity radius model of DOM is proposed.Extract the node of text block from Web pages by means of SVM.Use the links information from its page and the node above to calculate the gravity radius,and utilize gravity radius model of DOM to accurately extract content again.The process of corresponding formula derivation and hyper parameters selection are presented in this paper.Experimental results show that compared with statistical extraction,FFT extraction and other algorithm,the proposed algorithm has higher accuracy and efficiency as well as better generalization ability.

Key words: Support Vector Machine(SVM), feature vector, gravity radius, Web pages, content extraction

中图分类号:

TP18

易国洪,代瑜,冯智莉,黎慧源. 基于SVM与DOM重心半径模型的Web正文提取[J]. 计算机工程, 2019, 45(6): 206-210.

YI Guohong,DAI Yu,FENG Zhili,LI Huiyu. Web content extraction based on SVM and gravity radius model of DOM[J]. Computer Engineering, 2019, 45(6): 206-210.

https://www.ecice06.com/CN/Y2019/V45/I6/206

参考文献 15

［1］	IKVIK L.Information extraction from World Wide Web:a survey［M］.Oslo,Norway:Norweigan Computing Center,1999:8-9.
［2］	VAPNIK V N.The nature of statistical learning theory［M］.Berlin,Germany:Springer,1995.
［3］	HAMMER J,MCHUGH J,GARCIA-MOLIN H.Semistructured data:the TSIMMIS experience［C］//Proceedings of East-European Conference on Advances in Databases and Information Systems.Swindon,UK:British Computer Society,1997:1-8.
［4］	LIU Ling,PU Caltm,HAN Wei.XWRAP:an XML-enabled wrapper construction system for Web information sources［C］//Proceedings of International Conference on Data Engineering.Washington D.C.,USA:IEEE Press,2000:611-621.
［5］	CRESCENZI V,MECCA G,MERIALDO P.RoadRunner:automatic data extraction from data-intensive web sites［C］//Proceedings of ACM SIGMOD International Conference on Management of Data.New York,USA:ACM Press,2002:624-624.
［6］	FINN A,KUSHMERICK N,SMYTH B.Fact or fiction:content classification for digital libraries［EB/OL］.［2018-03-01］.https://www.ercim.eu/publication/ws-proceedings/DelNoe02/AidanFinn.pdf.
［7］	MANTRATZIS C,ORGUN M,CASSIDY S.Separating XHTML content from navigation clutter using DOM-structure block analysis［C］//Proceedings of ACM Conference on Hypertext and Hypermedia.New York,USA:ACM Press,2005:145-147.
［8］	孙承杰,关毅.基于统计的网页正文信息抽取方法的研究［J］.中文信息学报,2004,18(5):18-23.
［9］	SONG Ruihua,LIU Haifeng,WEN Jirong,et al.Learning important models for Web page blocks based on layout and content analysis［J］.ACM SIGKDD Explorations Newsletter,2004,6(2):14-23.
［10］	胡国平,张巍,王仁华.基于双层决策的新闻网页正文精确抽取［J］.中文信息学报,2006,20(6):1-9.
［11］	GIBSON J,WELLNER B,LUBAR S.Adaptive Web-page content identification［C］//Proceedings of ACM International Workshop on Web Information and Data Management.New York,USA:ACM Press,2007:105-112.
［12］	CAI Deng,YU Shipeng,WHEN Jirong,et al.VIPS:a vision based page segmentation algorithm［EB/OL］.［2018-03-01］.https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2003-79.pdf.
［13］	李蕾,王劲林,白鹤,等.基于FFT的网页正文提取算法研究与实现［J］.计算机工程与应用,2007,43(30):148-151.
［14］	朱泽德,李淼,张健,等.基于文本密度模型的Web正文抽取［J］.模式识别与人工智能,2013,26(7):667-672.
［15］	王辉,郁波,洪宇,等.基于知识图谱的Web信息抽取系统［J］.计算机工程,2017,43(6):118- 124.

[1]	陈增照, 王政, 郑秋雨. 基于全范围头部姿态估计的教师注意力识别算法[J]. 计算机工程, 2024, 50(7): 96-103.
[2]	王志江, 秦品乐, 柴锐, 武峰, 程一彤, 史玥. 基于深度学习的牙齿嵌塞自动判别方法[J]. 计算机工程, 2022, 48(4): 307-313.
[3]	张冰玉, 潘晴, 田妮莉, Everett Xiaolin Wang. 一种基于多重特征融合的信源个数估计方法[J]. 计算机工程, 2021, 47(4): 115-119,126.
[4]	王海, 翁晨傲, 李克, 骆曦. 一种面向基站扇区方向角估计的改进SVM算法[J]. 计算机工程, 2021, 47(4): 120-126.
[5]	连晓伟, 马垚, 陈永乐, 张壮壮, 王建华. 基于载荷特征与统计特征的Shodan流量识别[J]. 计算机工程, 2021, 47(1): 117-122.
[6]	袁哲明, 杨晶晶, 陈渊. 基于最大信息系数与冗余分摊的特征选择方法[J]. 计算机工程, 2020, 46(8): 101-105.
[7]	张瑞, 陈红卫. 基于特征优化与SVPSO的工控入侵检测[J]. 计算机工程, 2020, 46(4): 19-25.
[8]	付子爔, 徐洋, 吴招娣, 许丹丹, 谢晓尧. 基于增量学习的SVM-KNN网络入侵检测方法[J]. 计算机工程, 2020, 46(4): 115-122.
[9]	伍杰华, 熊云艳, 张顶, 陈嘉志. 融合多元影响力节点识别指标MPR的链接预测[J]. 计算机工程, 2020, 46(4): 301-308,315.
[10]	苏庆, 章静芳, 李小妹. 引入时间效应的SVD++线性回归推荐算法[J]. 计算机工程, 2020, 46(2): 65-71.
[11]	鲁淑霞, 蔡莲香, 张罗幻. 基于动量加速零阶减小方差的鲁棒支持向量机[J]. 计算机工程, 2020, 46(12): 88-95,104.
[12]	林超, 郑霖, 张文辉, 邓小芳. 基于随机矩阵理论的WSN异常节点定位算法[J]. 计算机工程, 2020, 46(1): 157-163.
[13]	张波, 周从华, 张付全, 张婷, 蒋跃明. 一种面向SNP选择的模糊聚类算法[J]. 计算机工程, 2019, 45(8): 66-74.
[14]	周梦妮, 牛焱, 曹锐, 阎鹏飞, 相洁. 基于相位同步的癫痫信号识别与分析[J]. 计算机工程, 2019, 45(7): 291-295,302.
[15]	苗续芝,陈伟,毕方明,房卫东,张武雄. 基于改进FOA-SVM的矿井火灾图像识别[J]. 计算机工程, 2019, 45(4): 267-274.

选择文件类型/文献管理软件名称

选择包含的内容

基于SVM与DOM重心半径模型的Web正文提取

Web content extraction based on SVM and gravity radius model of DOM

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献 15

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于SVM与DOM重心半径模型的Web正文提取

Web content extraction based on SVM and gravity radius model of DOM

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献 15

相关文章 15

编辑推荐

Metrics

本文评价