作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2009, Vol. 35 ›› Issue (14): 32-34. doi: 10.3969/j.issn.1000-3428.2009.14.012

• 软件技术与数据库 • 上一篇    下一篇

HTML表格向XML的智能转换

贾长云1,程永上2   

  1. (1. 淮海工学院计算机工程学院,连云港 222069;2. 河海大学计算机与信息工程学院,南京 210000)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-07-20 发布日期:2009-07-20

Intelligence Conversion of HTML Table into XML

JIA Chang-yun1, CHENG Yong-shang2   

  1. (1. School of Computer Engineering, Huaihai Institute of Technology, Lianyungang 222069;2. College of Computer and Information Engineering, Hohai University, Nanjing 21000)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-07-20 Published:2009-07-20

摘要: XML已经成为处理与管理信息的标准格式,而HTML表格被广泛应用于Web。为了充分利用与管理HTML表格信息,需要将HTML表格转换成XML。提出一种有效的处理方法,该方法包含2个部分,即表格识别与结构转换。表格识别通过检查格式、语法及语义的特征将表格提取出来并分割成值域与属性域,使用预设的表格模板分析属性域与值域间的层次结构并将其转换成XML格式。通过 300多个表格的实验表明,所提出的方法要优于传统方法,结果的准确率达86.7%。

关键词: HTML表格, 结构分析, 规范化, 信息提取, 可扩展标记语言

Abstract: While HTML tables are widely applied for Web, XML is widely accepted as a standard format to process and manage information. In order to utilize and manage XML, the HTML tables should be transformed into XML representations. This paper presents an efficient method for the process, which consists of two phases, such as area segmentation and structure analysis. The area segmentation cleans up tables and segments them into attribute and value areas by checking visual and semantic coherency. The hierarchical structure between attribute and value areas is analyzed and transformed into an XML representation using a proposed table model. Experimental results with more than 300 HTML tables show that the proposed method performs better than conventional methods, resulting in an average accuracy of 86.7%.

Key words: HTML table, structure analysis, normalization, information extraction, XML

中图分类号: