作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2017, Vol. 43 ›› Issue (12): 165-172. doi: 10.3969/j.issn.1000-3428.2017.12.031

• 人工智能及识别技术 • 上一篇    下一篇

Web表格的实体列发现算法

张丽方,王宁,齐飞   

  1. (北京交通大学 计算机与信息技术学院,北京 100044)
  • 收稿日期:2016-11-17 出版日期:2017-12-15 发布日期:2017-12-15
  • 作者简介:张丽方(1991—),女,硕士研究生,主研方向为Web数据集成、数据挖掘;王宁(通信作者),教授、博士;齐飞,硕士研究生。
  • 基金资助:
    国家自然科学基金(61370060)。

Entity Column Discovery Algorithm of Web Table

ZHANG Lifang,WANG Ning,QI Fei   

  1. (School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China)
  • Received:2016-11-17 Online:2017-12-15 Published:2017-12-15

摘要: 针对机器无法理解Web表格语义信息的问题,传统的实体列发现方法通常依靠表头信息和知识库发现实体列,不适用于没有表头的Web表格。为此,提出一种基于列值间近似依赖关系和规范化的Web表格实体列发现算法,对无表头或者无法恢复出完整表头的表格甚至多实体列表格进行实体列标注。由Web表格中的属性值探测出Web表格属性间内在的近似函数依赖关系,根据Web表格的特点对噪声函数依赖进行删减,通过函数依赖集进行规范化,得到Web表格的实体列。与利用知识库进行实体列探测的算法相比,该算法不依赖表头信息,召回率和精确度均提高了3%~5%,适用性更强。

关键词: Web表格, 实体列, 近似函数依赖, 语义恢复, 规范化

Abstract: Semantic information for Web tables is not understood by machines.Traditional entity column detection methods find entity columns with header information and knowledge base.They are not applicable for tables without headers.This paper proposes an entity column discovery algorithm of Web table based on column value of approximate functional dependencies and normalization,which is used to annotate entity column for tables that have no header or cannot restore a full header even multiple entity column tables.The approximate function dependency relations between Web table attributes are detected according to attribute values in Web tables.The noisy function dependency relations are filtered according to the characteristics of Web tables.The entity columns of the Web table are obtained by normalization of the function dependency set.Compared with entity column detection algorithm based on knowledge base,the proposed algorithm is independent of header information,3%~5% higher in precision and recall,and can be applied in more scenes.

Key words: Web table, entity column, approximate functional dependency, semantic recovery, normalization

中图分类号: