作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2011, Vol. 37 ›› Issue (6): 65-67. doi: 10.3969/j.issn.1000-3428.2011.06.023

• 软件技术与数据库 • 上一篇    下一篇

航班信息抽取规则的自动生成技术

张志远 1,2,徐 涛 1,2,冯 霞 1,2   

  1. (1. 中国民航大学计算机科学与技术学院,天津 300300;2. 中国民航信息技术科研基地,天津 300300)
  • 出版日期:2011-03-20 发布日期:2011-03-29
  • 作者简介:张志远(1978-),男,讲师,主研方向:数据挖掘;徐 涛、冯 霞,教授
  • 基金资助:
    国家“863”计划基金资助重点项目(2006AA12A106);中国民航大学科研基金资助项目(07kym04)

Auto Generation Technology for Flight Information Extraction Rules

ZHANG Zhi-Yuan 1,2, XU Tao 1,2, FENG Xia 1,2   

  1. (1. School of Computer Science & Technology, Civil Aviation University of China, Tianjin 300300, China; 2. Information Technology Research Base of CAAC, Tianjin 300300, China)
  • Online:2011-03-20 Published:2011-03-29

摘要: 在基于包装器的Web信息提取工作中,抽取规则占有重要的地位。由于网页经常改版,使得抽取规则需要不断更新,且手工生成抽取规则是一项费时费力的工作。为此,提出一种自动生成抽取规则的方法,通过扫描HTML源码,生成带语义信息的TABLE树,用以识别网页中的数据表格,并在此基础上利用贪心算法自动生成抽取规则。实验结果表明,该方法具有较高的准确率和F指数,且对于识别出的表格具有较高的规则生成率。

关键词: Web信息提取, 抽取规则, 语义TABLE树, 贪心算法

Abstract: Extraction rule plays an important role in Web information extraction based on wrappers. As the Web pages often change, the rule is updated frequently. However, it is a hard work to find extraction rule by hand. This paper proposes an auto extraction rule generation method, which constructs a semantic TABLE tree after scanning HTML code. The semantic TABLE trees is used to identify the data table, and the extraction rule is generated automatically through a greedy algorithm. Experiment result shows that it has high precision and F-score, and has high rule generation rate to the identified table.

Key words: Web information extraction, extraction rules, semantic TABLE trees, greedy algorithm

中图分类号: