基于通配符节点话题权重的Web新闻抽取方法

doi:10.19678/j.issn.1000-3428.0050057

计算机工程 ›› 2019, Vol. 45 ›› Issue (4): 275-280. doi: 10.19678/j.issn.1000-3428.0050057

基于通配符节点话题权重的Web新闻抽取方法

张恺航^1,2,徐克付³,张闯¹

1.中国科学院信息工程研究所,北京 100093; 2.中国科学院大学网络空间安全学院,北京 100049; 3.广州大学网络空间先进技术研究院,广州 510006

收稿日期:2018-01-10 出版日期:2019-04-15 发布日期:2019-04-15
作者简介:张恺航(1993—),男,硕士研究生,主研方向为信息检索、舆情计算;徐克付,研究员、博士;张闯,高级工程师、博士。
基金资助:
国家自然科学基金(61602474)。

Web News Extraction Method Based on Topic Weight of Wildcard Node

ZHANG Kaihang^1,2,XU Kefu³,ZHANG Chuang¹

1.Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093,China; 2.School of Cyber Security,University of Chinese Academy of Sciences,Beijing 100049,China; 3.Cyberspace Institute of Advanced Technology,Guangzhou University,Guangzhou 510006,China

Received:2018-01-10 Online:2019-04-15 Published:2019-04-15

摘要/Abstract

摘要：

现有Web新闻内容自动抽取方法多数未考虑文本中的话题特征,容易将样式排版与正文相似的噪音文本识别为正文内容。为此,提出基于通配符节点话题权重的抽取方法。将HTML文档解析成DOM树后,匹配DOM树对应的通配符树,并计算每个通配符中的话题权重,将高权重话题的通配符节点所覆盖的文本节点识别为正文节点。实验结果表明,与传统新闻抽取方法相比,该方法能降低Web新闻内容边缘噪音文本的错误识别率,抽取的新闻内容准确率更高。

关键词: 内容抽取, 通配符节点, 最大相容类, Otsu算法, 话题生成

Abstract:

Considering that most of existing methods on news content extraction do not take into account the topics features in the text,noise texts which are formatted like the news content may be identified as news content.Therefore,this paper proposes an extraction method based on topic weight of wildcard node.After parsing an HTML into DOM tree,it can match the DOM tree with a wildcard tree,and then calculates the weights of wildcard nodes.The wildcard nodes with high weight can be distinguish as news content.Experimental results show that the proposed method can reduce the false recognition rate of edge noise text of Web news content,and extract news content with higher accuracy rate compared with traditional news extraction methods.

Key words: content extraction, wildcard node, maximal compatibility class, Otsu algorithm, topic generation

中图分类号:

TP391.1

张恺航,徐克付,张闯. 基于通配符节点话题权重的Web新闻抽取方法[J]. 计算机工程, 2019, 45(4): 275-280.

ZHANG Kaihang,XU Kefu,ZHANG Chuang. Web News Extraction Method Based on Topic Weight of Wildcard Node[J]. Computer Engineering, 2019, 45(4): 275-280.

http://www.ecice06.com/CN/Y2019/V45/I4/275

参考文献

［1］蒲梅,周枫,周晶晶,等.基于加权TextRank的新闻关键事件主题句提取［J］.计算机工程,2017,34(8):219-224.
［2］ALLAN J,PAPKA R,LAVRENKO V.On-line new event detection and tracking［C］//Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,USA:ACM Press,1998:37-45.
［3］吴共庆,胡骏,李莉,等.基于标签路径特征融合的在线Web新闻内容抽取［J］.软件学报,2016,27(3):714-735.
［4］REIS D C,GOLGHER P B,SILVA A S,et al.Automatic Web news extraction using tree edit distance［C］//Proceedings of the 13th International Conference on World Wide Web.New York,USA:ACM Press,2004:502-511.
［5］FANG Y,XIE X,ZHANG X,et al.STEM:a suffix tree-based method for Web data records extraction［J］.Knowledge and Information Systems,2017,55(2):305-331.
［6］GULHANE P,MADAAN A,MEHTA R,et al.Web-scale information extraction with vertex［C］//Proceedings of the 27th International Conference on Data Engineering.Washington D.C.,USA:IEEE Press,2011:1209-1220.
［7］BING L,WONG T L,LAM W.Unsupervised extraction of popular product attributes from E-commerce Web sites by considering customer reviews［J］.ACM Transactions on Internet Technology,2016,16(2):12-15.
［8］CHARRON B,HIRATE Y,PURCELL D,et al.Extracting semantic information for E-commerce［C］//Proceedings of International Semantic Web Conference.Berlin,Germany:Springer,2016:273-290.
［9］GALI N,MARIESCU-ISTODOR R,FRNTI P.Using linguistic features to automatically extract Web page title［J］.Expert Systems with Applications,2017,79:296-312.
［10］ADELBERG B.NoDoSE——a tool for semi-automatically extracting structured and semistructured data from text documents［J］.ACM SIGMOD Record,1998,27(2):283-294.
［11］HAMMER J,GARCIA-MOLINA H,NESTOROV S,et al.Template-based wrappers in the TSIMMIS system［J］.ACM SIGMOD Record,1997,26(2):532-535.
［12］李效东,顾毓清.基于DOM的Web信息提取［J］.计算机学报,2002,25(5):526-533.
［13］KUSHMERICK N,WELD D S,DOORENBOS R B.Wrapper induction for information extraction［C］//Proceedings of International Joint Conference on Artificial Intelligence.New York,USA:ACM Press,1997:729-737.
［14］CAI D,YU S,WEN J R,et al.VIPS:a vision-based page segmentation algorithm［EB/OL］.［2017-12-11］.https://link.springer.com/content/pdf/10.1007/978-3-319-04244-2_22.pdf.
［15］SONG R,LIU H,WEN J R,et al.Learning block importance models for Web pages［C］//Proceedings of the 13th International Conference on World Wide Web.New York,USA:ACM Press,2004:203-211.
［16］WENINGER T,HSU W H,HAN J.CETR:content extraction via tag ratios［C］//Proceedings of the 19th International Conference on World Wide Web.New York,USA:ACM Press,2010:971-980.
［17］WU G,LI L,HU X,et al.Web news extraction via path ratios［C］//Proceedings of the 22nd ACM International Conference on Information and Knowledge Management.New York,USA:ACM Press,2013:2059-2068.

[1]	刘燕德, 曾体伟, 陈洞滨, 王观田. 基于区域相似信息的自适应运动目标检测算法[J]. 计算机工程, 2020, 46(3): 273-279.
[2]	杨先凤,吴姝泓. 基于自适应阈值与双特征的ViBe运动车辆检测算法[J]. 计算机工程, 2018, 44(10): 241-245,251.
[3]	付芸,白银浩,李展,万楚琦. 快速去除椒盐噪声的蛇形扫描滤波算法[J]. 计算机工程, 2017, 43(7): 229-233.
[4]	常戬,白佳弘. 基于回转对称双边滤波的Retinex图像增强算法[J]. 计算机工程, 2016, 42(6): 265-273.
[5]	杨陶,田怀文,刘晓敏,柯小甜,高松松,马梦婕. 基于边缘检测与Otsu的图像分割算法研究[J]. 计算机工程, 2016, 42(11): 255-260,266.
[6]	许志坚，孙蕾. 基于数据富集区域的Web内容自动抽取[J]. 计算机工程, 2013, 39(9): 192-195.
[7]	张建明, 魏林峰, 刘志强, 汪澎. 基于贝叶斯网络的疲劳度及注意力检测[J]. 计算机工程, 2012, 38(9): 189-192.
[8]	解姝, 叶施仁, 肖春. 社会媒体网页内容的分割与抽取[J]. 计算机工程, 2011, 37(21): 155-158.
[9]	姜桃, 赵春江, 陈明, 杨信廷, 孙传恒. 自适应图像模糊增强快速算法[J]. 计算机工程, 2011, 37(19): 213-214,223.
[10]	谢剑斌;刘通;陈章永;程永茂. 基于极值滤波和OTSU的票据纤维特征提取[J]. 计算机工程, 2009, 35(7): 177-179.

选择文件类型/文献管理软件名称

选择包含的内容

基于通配符节点话题权重的Web新闻抽取方法

Web News Extraction Method Based on Topic Weight of Wildcard Node

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 10

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于通配符节点话题权重的Web新闻抽取方法

Web News Extraction Method Based on Topic Weight of Wildcard Node

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 10

编辑推荐

Metrics

本文评价