结合网页结构与文本特征的正文提取方法

doi:10.3969/j.issn.1000-3428.2013.12.043

计算机工程

结合网页结构与文本特征的正文提取方法

熊忠阳，蔺显强，张玉芳，牙漫

(重庆大学计算机学院，重庆 400044)

收稿日期:2012-11-21 出版日期:2013-12-15 发布日期:2013-12-13
作者简介:熊忠阳(1962－)，男，教授，主研方向：数据挖掘，网格技术，并行计算；蔺显强，硕士研究生；张玉芳，教授；牙漫，硕士研究生
基金资助:
国家自然科学基金资助项目(71102065)

Content Extraction Method Combining Web Page Structure and Text Feature

XIONG Zhong-yang, LIN Xian-qiang, ZHANG Yu-fang, YA Man

(College of Computer Science, Chongqing University, Chongqing 400044, China)

Received:2012-11-21 Online:2013-12-15 Published:2013-12-13

摘要/Abstract

摘要： 网页中存在正文信息以及与正文无关的信息，无关信息的存在对Web页面的分类、存储及检索等带来负面的影响。为降低无关信息的影响，从网页的结构特征和文本特征出发，提出一种结合网页结构特征与文本特征的正文提取方法。通过正则表达式去除网页中的无关元素，完成对网页的初次过滤。根据网页的结构特征对网页进行线性分块，依据各个块的文本特征将其区分为链接块与文本块，并利用噪音块连续出现的结果完成对正文部分的定位，得到网页正文信息。实验结果表明，该方法能够快速准确地提取网页的正文内容。

关键词: 正文提取, 网页去噪, 网页分块, 主题爬行, 信息检索, Web挖掘

Abstract: There are both relevant information and irrelevant information in a Web page, the irrelevant information brings some negative influence to their classification, storage and retrieve. In order to reduce the influence, aiming at theme-related Web pages, this paper proposes a new method to extract the content of Web pages based on their text and structural features. It removes those unrelated tags in the Web page by regular expressions, and segments the Web into blocks according to Web pages structure and the text information. By analyzing the text blocks and link blocks of the Web, it only retains the main content of the page; those noisy parts are deleted from the page. Experimental result shows that the method is feasible and of high accuracy in page cleaning and content extraction.

Key words: content extraction, Web page denoising, Web page segmentation, subject crawling, information retrieve, Web mining

中图分类号:

TP18

熊忠阳，蔺显强，张玉芳，牙漫. 结合网页结构与文本特征的正文提取方法[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2013.12.043.

XIONG Zhong-yang, LIN Xian-qiang, ZHANG Yu-fang, YA Man. Content Extraction Method Combining Web Page Structure and Text Feature[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2013.12.043.

https://www.ecice06.com/CN/Y2013/V39/I12/200

参考文献

参考文献 [1] Gibson D, Punera K, Tomkins A. The Volume and Evolution of Web Page Templates[C]//Proc. of the 14th International Conference on World Wide Web. New York, USA: ACM Press, 2005. [2] Rahman A, Alam H, Hartono R. Content Extraction from HTML Documents[C]//Proc. of the 1st International Workshop on Web Document Analysis. New York, USA: ACM Press, 2001. [3] Wang Jiying, Lochovsky F H. Data-rich Section Extraction from HTML Pages[C]//Proc. of the 3rd International Conference on Web Information Systems Engineering. Washington D. C., USA: IEEE Computer Society, 2002. [4] 欧健文, 董守斌, 蔡斌. 模板化网页主题信息的提取方法[J]. 清华大学学报: 自然科学版, 2005, 45(S1): 1743- 1747. [5] Sun Fei, Song Dandan, Liao Lejian. Dom Based Content Extraction via Text Density[C]//Proc. of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2011. [6] Weninger T, Hsu W H, Han J. CETR: Content Extraction via Tag Ratios[C]//Proc. of the 19th International Conference on World Wide Web. New York, USA: ACM Press, 2010. [7] Abdul P, Qureshi R, Memon N. Hybrid Model of Content Extraction[J]. Journal of Computer and System Sciences, 2012, 78(4): 1248-1257. [8] Cai Deng, Yu Shipeng, Wen Jirong, et al. VIPS: A Vision Based Page Segmentation Algorithm[EB/OL]. (2003-10-20). http://research.microsoft.com/apps/pubs/default.aspx?id=70027. [9] Song Mingqiu, WU Xintao. Content Extraction from Web Pages Based on Chinese Punctuation Number[C]//Proc. of International Conference on Wireless Communications, Networking and Mobile Computing. [S. 1.]: IEEE Press, 2007. [10] 张志刚, 陈静, 李晓明. 一种HTML网页净化方法[J]. 情报学报, 2004, 23(4): 387-393. [11] 陈竹敏. 面向垂直搜索引擎的主题爬行技术研究[D]. 济南:山东大学, 2008. [12] 聂卉, 张津华. 分块布局下的主题型网页的内容抽取[J]. 情报学报, 2012, 31(1): 31-39. 编辑索书志

[1]	李雪, 王雅文, 张前进. 基于信息检索的源代码自动命名[J]. 计算机工程, 2024, 50(6): 304-310.
[2]	李佩, 陈乔松, 陈鹏昌, 邓欣, 王进, 朴昌浩. 基于模态特异及模态共享特征信息的多模态细粒度检索[J]. 计算机工程, 2022, 48(11): 62-68,76.
[3]	易国洪,代瑜,冯智莉,黎慧源. 基于SVM与DOM重心半径模型的Web正文提取[J]. 计算机工程, 2019, 45(6): 206-210.
[4]	高军,黄献策. 基于Hadoop平台的相关性权重算法设计与实现[J]. 计算机工程, 2019, 45(3): 26-31.
[5]	张倩倩,田学东,杨芳,李新福. 基于数学文本和表达式转换的融合检索模型[J]. 计算机工程, 2019, 45(3): 175-181,187.
[6]	塞麦提·麦麦提敏, 司马义·阿不都热依木. 维吾尔语停用词抽取方法研究[J]. 计算机工程, 2019, 45(10): 288-292,300.
[7]	王莹,罗准辰,于洋. 基于排序学习模型的微博多样性检索问题研究[J]. 计算机工程, 2017, 43(11): 152-160.
[8]	覃华峥,胡忠顺,阳德青,肖仰华. 基于类别模板挖掘的百科相关实体构建[J]. 计算机工程, 2016, 42(9): 180-185,191.
[9]	毋光先,刘年义,刘博雅. 基于LWE的BGN类CPA安全加密方案设计与应用[J]. 计算机工程, 2016, 42(12): 118-123.
[10]	姬鹏飞,李远刚,卢盛祺,戴开宇. 基于语义Web的旅游路线个性化定制系统[J]. 计算机工程, 2016, 42(10): 308-317.
[11]	邓晓军,满君丰,欧阳旻. 基于K武装决斗土匪问题的排序器在线评估算法[J]. 计算机工程, 2015, 41(9): 271-275.
[12]	李金忠,杨威,夏洁武,曾小荟,孙凌宇. 基于Hooke & Jeeves模式搜索的排序学习方法[J]. 计算机工程, 2015, 41(7): 215-218.
[13]	许家铭，李晓东，金键，马盈. 一种高效的多模式字符串匹配算法[J]. 计算机工程, 2014, 40(3): 315-320.
[14]	张旭东，孙志明，刘亚宁，单栋栋，闫宏飞. 基于64位体系结构的倒排索引压缩算法[J]. 计算机工程, 2014, 40(2): 71-76.
[15]	朱菁华,王晓玲. 基于扩展查询表达式的XML 关键字查询[J]. 计算机工程, 2014, 40(10): 25-31.

选择文件类型/文献管理软件名称

选择包含的内容

结合网页结构与文本特征的正文提取方法

Content Extraction Method Combining Web Page Structure and Text Feature

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

结合网页结构与文本特征的正文提取方法

Content Extraction Method Combining Web Page Structure and Text Feature

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价