作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (12): 140-149. doi: 10.19678/j.issn.1000-3428.0063070

• 人工智能与模式识别 • 上一篇    下一篇

基于图排序和最大信息增益的领域实体抽取方法

张晓明, 郑理欣, 王会勇   

  1. 河北科技大学 信息科学与工程学院, 石家庄 050018
  • 收稿日期:2021-10-27 修回日期:2022-01-12 发布日期:2022-01-20
  • 作者简介:张晓明(1975—),男,教授、博士,主研方向为语义计算、知识图谱;郑理欣,硕士研究生;王会勇(通信作者),副教授、博士。
  • 基金资助:
    河北省自然科学基金(F2018208116);河北省高等学校科学技术研究重点项目(ZD2021048)。

Domain Entity Extraction Method Based on Graph Sorting and Maximal Information Gain

ZHANG Xiaoming, ZHENG Lixin, WANG Huiyong   

  1. School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China
  • Received:2021-10-27 Revised:2022-01-12 Published:2022-01-20

摘要: 领域知识图谱在各行各业中都发挥着重要作用,领域实体的获取则是构建领域知识图谱的重要基础。数据标注、编写抽取规则等现有的实体抽取方法往往需要较多的人工参与工作。提出一种基于图排序的实体抽取方法和基于最大信息增益的实体扩展方法来构建领域实体集,通过实体识别获得候选实体,基于维基百科的背景信息计算候选实体间的相关度构建实体图,并利用基于置信度传播的图排序算法筛选领域核心实体。在DBpedia中根据最大信息增益来平衡类与领域核心实体相关性及类的抽象程度两个因素以生成实体扩展的共性类。在此基础上,通过SKOS体系中的“Is subject of”关系获得共性类的实例实体,并根据基于字符串相似和结构相关度的方法对扩展实例实体进一步筛选,最终获得全面、准确的领域实体集。以数据结构课程为例构建该课程领域实体集,得到1 115个实体。实验结果表明,在领域数据集上,领域实体抽取F1值达到0.67,能够在较少人工参与的条件下有效获得领域实体,有助于领域知识图谱的构建。

关键词: 实体抽取, 实体扩展, 图排序算法, 最大信息增益, 知识图谱

Abstract: Domain knowledge graphs play an important role in various industries, and the acquisition of the domain entity is an important basis for their construction.However, existing approaches frequently rely on human work such as data annotation and the compilation of extraction rules.To address this problem, this paper proposes a graph-based propagation method for extracting entities and provides an expansion of the core entities using the concept of maximal information gain. Subsequently, an entity graph is constructed through entity recognition, the relevance of the candidate entities is calculated, and the domain entities are screened using the graph sorting algorithm based on confidence propagation.The entities are then expanded according to the maximal information gain principle to balance the correlation and degree of abstraction of classes, which are used to generate generic classes.Finally, the instance entity of the generic class is obtained using the "Is subject of" relationship of the SKOS system, and the extended instance entity is filtered based on string similarity and structural relevance to obtain a comprehensive and accurate domain entity set. This paper takes a data structure course as an example to construct the entity set of the course domain, and 1 115 entities are obtained. The results show that the domain data set F1 of the entity extension experiment reaches 0.67, which can effectively obtain domain entities with less human participation, making it is useful in the construction of domain knowledge graphs.

Key words: entity extraction, entity expansion, graph sorting algorithm, maximal information gain, knowledge graph

中图分类号: