摘要: 随着互联网的普及,信息技术的发展,形成了大量的新闻信息资源。从海量的新闻信息中抽取出有用的资源,是当前迫切需要解决的问题。该文在分析新闻网页结构的基础上,结合了基于DOM 的结构抽取和基于文本特征模式抽取两种处理技术的优点,提出了基于Web 新闻网页的半自动化抽取技术,自动下载了有用的Web 页面,抽取了所需的新闻信息。最后,该文描述了一个面向奥运新闻的信息抽取系统,并给出了该系统的实验结果。
关键词:
信息抽取;包装器;DOM;抽取规则
Abstract: With the widespread use of Internet and the development of information technology, there are a tremendous amount of news information resource. The ability to quickly obtain useful resource from the huge news information is a crucial problem at present. Based on the analysis of news information, this paper introduces an approach of semi-automatically extracting from Web resource. Moreover, it gives the system which extracts useful Olympic news information and experiment results of it.
Key words:
Information extraction; Wrapper; DOM; Extraction rule
朱永盛,武港山. 基于 Web 的新闻信息抽取[J]. 计算机工程, 2006, 32(10): 74-76.
ZHU Yongsheng, WU Gangshan. News Information Extraction for Web Resource[J]. Computer Engineering, 2006, 32(10): 74-76.