作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2008, Vol. 34 ›› Issue (20): 55-57. doi: 10.3969/j.issn.1000-3428.2008.20.020

• 软件技术与数据库 • 上一篇    下一篇

基于模板流程配置的Web信息抽取

刘 辉,陈静玉,徐学洲   

  1. (西安电子科技大学软件工程研究所,西安 710071)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-10-20 发布日期:2008-10-20

Web Information Extraction Based on Template Flow Configuration

LIU Hui, CHEN Jing-yu, XU Xue-zhou   

  1. (Software Engineering Institute, Xidian University, Xi’an 710071)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-10-20 Published:2008-10-20

摘要: 针对Web信息抽取中存在的包装器构造复杂及抽取精度等问题,提出并实现了一种基于模板流程配置的Web信息抽取框架。将用户请求、访问和获取Web页面的动作进行分解,抽取其中的动作模式,并映射到流程配置模板中的节点。通过流程解析器对用户创建的流程配置XML描述文档进行解析,抽取感兴趣的信息。试验结果表明,系统可快速、准确地实现抽取。

关键词: Web信息抽取, 模板流程配置, 包装器, 框架

Abstract: To solve the existing problems such as the complexity to constructing wrappers and extracting precision, a Web extraction framework based on template flow configuration is presented and accomplished. Decompose the actions of requiring, accessing and obtaining of users, and extract those action patterns, reflecting them into the flow configuration template as nodes. Flow interpreter will interpret the flow configuration description XML document which is created by users, and then extract the information which is interesting to them. Experimental result indicates that the framework can quickly and correctly realize the extraction.

Key words: Web information extraction, template flow configuration, wrapper, framework

中图分类号: