Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering

Previous Articles     Next Articles

Web Information Automatic Extraction Based on DOM Tree and Visual Feature

HUANG Wu-guan, ZHU Ming, YIN Wen-ke   

  1. (Department of Automation, University of Science and Technology of China, Hefei 230027, China)
  • Received:2012-08-10 Online:2013-10-15 Published:2013-10-14

基于DOM树和视觉特征的网页信息自动抽取

黄武冠,朱 明,尹文科   

  1. (中国科学技术大学自动化系,合肥 230027)
  • 作者简介:黄武冠(1987-),男,硕士研究生,主研方向:Web信息抽取;朱 明,教授、博士生导师;尹文科,博士研究生
  • 基金资助:
    国家科技支撑计划基金资助项目(2011BAH11B01);中国科学院重点部署基金资助项目(KGZD-EW-103-(5))

Abstract: This paper proposes an automatic extraction method based on Document Object Model(DOM) tree and visual features for Web information to extract businesses information in list pages of life information websites. By analyzing and using DOM tree and visual features of data regions in list pages, the method can get the candidate target data regions firstly. The method identifies the target data region by making use of visual features and extracts data records finally. The method tests ten life information websites and achieves 100% recall and 100% precision on eight websites. The results show that the proposed method can get better results.

Key words: Document Object Model(DOM) tree, visual feature, automatic extraction, data recording, data region, mining algorithm

摘要: 针对生活信息服务网站的列表式商户信息,提出一种基于文档对象模型(DOM)树和视觉特征的网页信息自动抽取方法。利用商户信息列表页面中数据区域的DOM树结构和视觉特征,搜索得到候选目标数据区域,再利用视觉特征识别真正目标数据区域,从而抽取其中的数据记录。对10个生活信息服务网站进行测试,结果表明,有8个网站的召回率和准确率达到100%,取得了较好的结果。

关键词: 文档对象模型树, 视觉特征, 自动抽取, 数据记录, 数据区域, 挖掘算法

CLC Number: