基于 Web 的新闻信息抽取

doi:10.3969/j.issn.1000-3428.2006.10.027

计算机工程 ›› 2006, Vol. 32 ›› Issue (10): 74-76.

基于 Web 的新闻信息抽取

朱永盛 1，武港山2

1. 南京大学计算机软件新技术国家重点实验室，南京 210093；2. 南京大学计算机科学与技术系，南京210093

出版日期:2006-05-20 发布日期:2006-05-20

News Information Extraction for Web Resource

ZHU Yongsheng1, WU Gangshan2

1. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093;2. Department of Computer Science & Technology, Nanjing University, Nanjing 210093

Online:2006-05-20 Published:2006-05-20

摘要/Abstract

摘要： 随着互联网的普及，信息技术的发展，形成了大量的新闻信息资源。从海量的新闻信息中抽取出有用的资源，是当前迫切需要解决的问题。该文在分析新闻网页结构的基础上，结合了基于DOM 的结构抽取和基于文本特征模式抽取两种处理技术的优点，提出了基于Web 新闻网页的半自动化抽取技术，自动下载了有用的Web 页面，抽取了所需的新闻信息。最后，该文描述了一个面向奥运新闻的信息抽取系统，并给出了该系统的实验结果。

关键词: 信息抽取；包装器；DOM；抽取规则

Abstract: With the widespread use of Internet and the development of information technology, there are a tremendous amount of news information resource. The ability to quickly obtain useful resource from the huge news information is a crucial problem at present. Based on the analysis of news information, this paper introduces an approach of semi-automatically extracting from Web resource. Moreover, it gives the system which extracts useful Olympic news information and experiment results of it.

Key words: Information extraction; Wrapper; DOM; Extraction rule

朱永盛，武港山. 基于 Web 的新闻信息抽取[J]. 计算机工程, 2006, 32(10): 74-76.

ZHU Yongsheng, WU Gangshan. News Information Extraction for Web Resource[J]. Computer Engineering, 2006, 32(10): 74-76.

https://www.ecice06.com/CN/Y2006/V32/I10/74

选择文件类型/文献管理软件名称

选择包含的内容

基于 Web 的新闻信息抽取

News Information Extraction for Web Resource

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于 Web 的新闻信息抽取

News Information Extraction for Web Resource

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

编辑推荐

Metrics

本文评价