Automatic Web News Extraction with Semantic Features

doi:10.3969/j.issn.1000-3428.2010.07.059

Computer Engineering ›› 2010, Vol. 36 ›› Issue (7): 173-175,. doi: 10.3969/j.issn.1000-3428.2010.07.059

• Artificial Intelligence and Recognition Technology • Previous Articles Next Articles

Automatic Web News Extraction with Semantic Features

SHI Yang, ZHANG Qi, HUANG Xuan-jing

(School of Computer Science, Fudan University, Shanghai 200433)

Received:1900-01-01 Revised:1900-01-01 Online:2010-04-05 Published:2010-04-05

含有语义特征的网页新闻自动抽取

施洋，张奇，黄萱菁

(复旦大学计算机科学技术学院，上海 200433)

Abstract

Abstract: This paper analyzes the semantic features and the similarity of Web news pages, and presents an automatic Web news extraction method with semantic features. It utilizes semantic classifier to find the seed information, and uses portion features to build information extraction rules. The F1-Value of Web news extraction can reach to 94.2% when add semantic features to classifier. The performance of F1-Value can reach to 96.9% when combine semantic classifier and portion features based information extraction method. Experimental result shows that the method can effectively improve the accuracy of Web information extraction method and cut the cost of manual labeling work.

Key words: Web information extraction, semantic features, portion features

摘要： 通过分析新闻网页的语义特征以及网页之间存在的通用性质，提出一种含有语义特征的网页新闻自动抽取方法，包括利用语义分类器识别新闻网页中的种子信息以及页面中的局部信息来完成抽取。在分类器中加入语义特征可以使F1值达到94.2%。在语义分类器与局部特征结合的情况下，F1值可以达到96.9%。实验结果证明，该方法能有效提高网页信息抽取算法的精度，降低机器学习所需要的标注成本。

关键词: 网络信息抽取, 语义特征, 局部特征

CLC Number:

TP393

SHI Yang; ZHANG Qi; HUANG Xuan-jing. Automatic Web News Extraction with Semantic Features[J]. Computer Engineering, 2010, 36(7): 173-175,.

施洋;张奇;黄萱菁. 含有语义特征的网页新闻自动抽取[J]. 计算机工程, 2010, 36(7): 173-175,.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.3969/j.issn.1000-3428.2010.07.059

http://www.ecice06.com/EN/Y2010/V36/I7/173

[1]	ZHONG Jian, XU Yang, CHEN Shuwei, HE Xingxing. Stability-based Term Evaluation Method in First-order Logic [J]. Computer Engineering, 2019, 45(11): 183-190,197.
[2]	WANG Hui,YU Bo,HONG Yu,XIAO Yanghua. Web Information Extraction System Based on Knowledge Graph [J]. Computer Engineering, 2017, 43(6): 118-124.
[3]	LI Qiang-Cheng, ZHANG An-Zhan, GONG Xiao-Li, ZHANG Jin. Research and Realization of Online Book System in E-paper Reader [J]. Computer Engineering, 2012, 38(3): 261-264.
[4]	XU Long, JIAN Jiang. Description and Analysis of Topic-oriented Information Extraction Requirement [J]. Computer Engineering, 2012, 38(23): 57-59.
[5]	ZHANG Zhi-Yuan, XU Chao, FENG Xia. Auto Generation Technology for Flight Information Extraction Rules [J]. Computer Engineering, 2011, 37(6): 65-67.
[6]	YANG Zhou, ZHUO Lin, DIAO Peng-Peng, CUI Zhi-Meng. Automatic Extraction Method for Product Data Records [J]. Computer Engineering, 2010, 36(23): 262-265.
[7]	GONG Ji-bing; TANG Jie. Ontology-based Video Describing Information Extraction System [J]. Computer Engineering, 2009, 35(18): 34-36.
[8]	Zhu Jie; Ngodrup; GeSang Dorje. Tibetan Web Information Extraction Based on DOM Pruning [J]. Computer Engineering, 2008, 34(24): 58-60.
[9]	HU Ren-long; YUAN Chun-feng; WU Gang-shan; PU Xiao-jia. Automatic Web Information Extraction Based on Repetitive Pattern [J]. Computer Engineering, 2008, 34(22): 73-76.
[10]	LIU Hui; CHEN Jing-yu; XU Xue-zhou. Web Information Extraction Based on Template Flow Configuration [J]. Computer Engineering, 2008, 34(20): 55-57.

Please choose a citation manager

Content to export

Automatic Web News Extraction with Semantic Features

含有语义特征的网页新闻自动抽取

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 10

Recommended Articles

Metrics

Comments

模态框（Modal）标题

Please choose a citation manager

Content to export

Automatic Web News Extraction with Semantic Features

含有语义特征的网页新闻自动抽取

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 10

Recommended Articles

Metrics

Comments