计算机工程 ›› 2019, Vol. 45 ›› Issue (6): 206-210.doi: 10.19678/j.issn.1000-3428.0050677

• 人工智能及识别技术 • 上一篇    下一篇

基于SVM与DOM重心半径模型的Web正文提取

易国洪a,b,代瑜a,冯智莉a,黎慧源a   

  1. 武汉工程大学 a.计算机科学与工程学院; b.智能机器人湖北省重点实验室,武汉 430205
  • 收稿日期:2018-03-08 出版日期:2019-06-15 发布日期:2019-06-15
  • 作者简介:易国洪(1972—),男,副教授、硕士,主研方向为数据挖掘、软件工程、Web语义;代瑜(通信作者)、冯智莉、黎慧源,硕士研究生。
  • 基金项目:
    国家自然科学基金青年项目“基于能力集成动态规划的自适应软件需求的群体决策问题研究”(61502355)。

Web content extraction based on SVM and gravity radius model of DOM

YI Guohonga,b,DAI Yua,FENG Zhilia,LI Huiyua   

  1. a.School of Computer Science and Engineering;b.Hubei Provincial Key Laboratory of Intelligent Robot,Wuhan Institute of Technology,Wuhan 430205,China
  • Received:2018-03-08 Online:2019-06-15 Published:2019-06-15

摘要:

为了从网页中精确地提取正文内容,提出一种基于支持向量机(SVM)与DOM重心半径模型的算法。通过SVM对网页DOM节点集进行提取,得出文本块节点。根据网页链接信息和初次提取的文本块节点计算重心半径,利用重心半径模型进行二次精确提取,并给出相应的公式推导和超参数选取过程。实验结果表明,与统计抽取、FFT抽取等算法相比,该算法的准确率和提取效率较高,泛化能力较好。

关键词: 支持向量机, 特征向量, 重心半径, 网页, 正文提取

Abstract:

To extract the content from a Web page accurately,an algorithm based on Support Vector Machine(SVM) and gravity radius model of DOM is proposed.Extract the node of text block from Web pages by means of SVM.Use the links information from its page and the node above to calculate the gravity radius,and utilize gravity radius model of DOM to accurately extract content again.The process of corresponding formula derivation and hyper parameters selection are presented in this paper.Experimental results show that compared with statistical extraction,FFT extraction and other algorithm,the proposed algorithm has higher accuracy and efficiency as well as better generalization ability.

Key words: Support Vector Machine(SVM), feature vector, gravity radius, Web pages, content extraction

中图分类号: