计算机工程 ›› 2019, Vol. 45 ›› Issue (4): 275-280.doi: 10.19678/j.issn.1000-3428.0050057

• 开发研究与工程应用 • 上一篇    下一篇

基于通配符节点话题权重的Web新闻抽取方法

张恺航1,2,徐克付3,张闯1   

  1. 1.中国科学院信息工程研究所,北京 100093; 2.中国科学院大学 网络空间安全学院,北京 100049; 3.广州大学 网络空间先进技术研究院,广州 510006
  • 收稿日期:2018-01-10 出版日期:2019-04-15 发布日期:2019-04-15
  • 作者简介:张恺航(1993—),男,硕士研究生,主研方向为信息检索、舆情计算;徐克付,研究员、博士;张闯,高级工程师、博士。
  • 基金项目:

    国家自然科学基金(61602474)。

Web News Extraction Method Based on Topic Weight of Wildcard Node

ZHANG Kaihang1,2,XU Kefu3,ZHANG Chuang1   

  1. 1.Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093,China; 2.School of Cyber Security,University of Chinese Academy of Sciences,Beijing 100049,China; 3.Cyberspace Institute of Advanced Technology,Guangzhou University,Guangzhou 510006,China
  • Received:2018-01-10 Online:2019-04-15 Published:2019-04-15

摘要:

现有Web新闻内容自动抽取方法多数未考虑文本中的话题特征,容易将样式排版与正文相似的噪音文本识别为正文内容。为此,提出基于通配符节点话题权重的抽取方法。将HTML文档解析成DOM树后,匹配DOM树对应的通配符树,并计算每个通配符中的话题权重,将高权重话题的通配符节点所覆盖的文本节点识别为正文节点。实验结果表明,与传统新闻抽取方法相比,该方法能降低Web新闻内容边缘噪音文本的错误识别率,抽取的新闻内容准确率更高。

关键词: 内容抽取, 通配符节点, 最大相容类, Otsu算法, 话题生成

Abstract:

Considering that most of existing methods on news content extraction do not take into account the topics features in the text,noise texts which are formatted like the news content may be identified as news content.Therefore,this paper proposes an extraction method based on topic weight of wildcard node.After parsing an HTML into DOM tree,it can match the DOM tree with a wildcard tree,and then calculates the weights of wildcard nodes.The wildcard nodes with high weight can be distinguish as news content.Experimental results show that the proposed method can reduce the false recognition rate of edge noise text of Web news content,and extract news content with higher accuracy rate compared with traditional news extraction methods.

Key words: content extraction, wildcard node, maximal compatibility class, Otsu algorithm, topic generation

中图分类号: