摘要: XML已经成为处理与管理信息的标准格式,而HTML表格被广泛应用于Web。为了充分利用与管理HTML表格信息,需要将HTML表格转换成XML。提出一种有效的处理方法,该方法包含2个部分,即表格识别与结构转换。表格识别通过检查格式、语法及语义的特征将表格提取出来并分割成值域与属性域,使用预设的表格模板分析属性域与值域间的层次结构并将其转换成XML格式。通过 300多个表格的实验表明,所提出的方法要优于传统方法,结果的准确率达86.7%。
关键词:
HTML表格,
结构分析,
规范化,
信息提取,
可扩展标记语言
Abstract: While HTML tables are widely applied for Web, XML is widely accepted as a standard format to process and manage information. In order to utilize and manage XML, the HTML tables should be transformed into XML representations. This paper presents an efficient method for the process, which consists of two phases, such as area segmentation and structure analysis. The area segmentation cleans up tables and segments them into attribute and value areas by checking visual and semantic coherency. The hierarchical structure between attribute and value areas is analyzed and transformed into an XML representation using a proposed table model. Experimental results with more than 300 HTML tables show that the proposed method performs better than conventional methods, resulting in an average accuracy of 86.7%.
Key words:
HTML table,
structure analysis,
normalization,
information extraction,
XML
中图分类号:
贾长云;程永上. HTML表格向XML的智能转换[J]. 计算机工程, 2009, 35(14): 32-34.
JIA Chang-yun; CHENG Yong-shang. Intelligence Conversion of HTML Table into XML[J]. Computer Engineering, 2009, 35(14): 32-34.