作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2009, Vol. 35 ›› Issue (21): 51-53. doi: 10.3969/j.issn.1000-3428.2009.21.017

• 软件技术与数据库 • 上一篇    下一篇

基于XML的Web数据半自动采集

蒋宏潮,王大亮,班晓娟,阮进喜   

  1. (北京科技大学信息工程学院,北京 100083)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-11-05 发布日期:2009-11-05

Web Data Sime-automatic Extraction Based on XML

JIANG Hong-chao, WANG Da-liang, BAN Xiao-juan, RUAN Jin-xi   

  1. (School of Information Engineering, University of Science and Technology Beijing, Beijing 100083)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-11-05 Published:2009-11-05

摘要: 如何在信息量巨大的互联网上准确获取并长期跟踪用户关注的内容,是数据采集和挖掘的重要方面。探讨Web数据采集理论及其应用技术,给出一个半自动采集模型,设计基于旅游业数据的采集系统,验证数据半自动采集的可行性。

关键词: 数据采集, 信息采集, 半结构化数据

Abstract: It is an important aspect of data extraction and mining that how to exactly gain and chronically trace the content regarded by users on Internet with huge information. This paper discusses Web data extraction theories and its application technologies, gives a sime-automatic extraction model, and designs a extraction system based on tourism industry data to prove the feasibility data sime-automatic extraction.

Key words: data extraction, information extraction, semi-structured data

中图分类号: