基于图排序和最大信息增益的领域实体抽取方法

doi:10.19678/j.issn.1000-3428.0063070

摘要/Abstract

摘要： 领域知识图谱在各行各业中都发挥着重要作用，领域实体的获取则是构建领域知识图谱的重要基础。数据标注、编写抽取规则等现有的实体抽取方法往往需要较多的人工参与工作。提出一种基于图排序的实体抽取方法和基于最大信息增益的实体扩展方法来构建领域实体集，通过实体识别获得候选实体，基于维基百科的背景信息计算候选实体间的相关度构建实体图，并利用基于置信度传播的图排序算法筛选领域核心实体。在DBpedia中根据最大信息增益来平衡类与领域核心实体相关性及类的抽象程度两个因素以生成实体扩展的共性类。在此基础上，通过SKOS体系中的“Is subject of”关系获得共性类的实例实体，并根据基于字符串相似和结构相关度的方法对扩展实例实体进一步筛选，最终获得全面、准确的领域实体集。以数据结构课程为例构建该课程领域实体集，得到1 115个实体。实验结果表明，在领域数据集上，领域实体抽取F1值达到0.67，能够在较少人工参与的条件下有效获得领域实体，有助于领域知识图谱的构建。

关键词: 实体抽取, 实体扩展, 图排序算法, 最大信息增益, 知识图谱

Abstract: Domain knowledge graphs play an important role in various industries, and the acquisition of the domain entity is an important basis for their construction.However, existing approaches frequently rely on human work such as data annotation and the compilation of extraction rules.To address this problem, this paper proposes a graph-based propagation method for extracting entities and provides an expansion of the core entities using the concept of maximal information gain. Subsequently, an entity graph is constructed through entity recognition, the relevance of the candidate entities is calculated, and the domain entities are screened using the graph sorting algorithm based on confidence propagation.The entities are then expanded according to the maximal information gain principle to balance the correlation and degree of abstraction of classes, which are used to generate generic classes.Finally, the instance entity of the generic class is obtained using the "Is subject of" relationship of the SKOS system, and the extended instance entity is filtered based on string similarity and structural relevance to obtain a comprehensive and accurate domain entity set. This paper takes a data structure course as an example to construct the entity set of the course domain, and 1 115 entities are obtained. The results show that the domain data set F1 of the entity extension experiment reaches 0.67, which can effectively obtain domain entities with less human participation, making it is useful in the construction of domain knowledge graphs.

Key words: entity extraction, entity expansion, graph sorting algorithm, maximal information gain, knowledge graph

中图分类号:

TP18

张晓明, 郑理欣, 王会勇. 基于图排序和最大信息增益的领域实体抽取方法[J]. 计算机工程, 2022, 48(12): 140-149.

ZHANG Xiaoming, ZHENG Lixin, WANG Huiyong. Domain Entity Extraction Method Based on Graph Sorting and Maximal Information Gain[J]. Computer Engineering, 2022, 48(12): 140-149.

http://www.ecice06.com/CN/Y2022/V48/I12/140

图/表 13

20230112183330

20230112183333

20230112183337

20230112183340

20230112183344

20230112183349

20230112183400

20230112183403

20230112183407

20230112183411

20230112183414

20230112183419

20230112183422

参考文献

[1] 张雪, 孙宏宇, 辛东兴, 等.自动术语抽取研究综述[J].软件学报, 2020, 31(7):2062-2094. ZHANG X, SUN H Y, XIN D X, et al.Survey on automatic term extraction research[J].Journal of Software, 2020, 31(7):2062-2094.(in Chinese)
[2] CHEN P H, LU Y, ZHENG V W, et al.An automatic knowledge graph construction system for K-12 education[C]//Proceedings of the 5th Annual ACM Conference on Learning at Scale.New York, USA:ACM Press, 2018:1-4.
[3] ALIYU I, KANA A F D, ALIYU S, et al.Development of knowledge graph for university courses management[J].International Journal of Education and Management Engineering, 2020, 10(2):1-10.
[4] SHI D Q, WANG T, XING H, et al.A learning path recommendation model based on a multidimensional knowledge graph framework for e-learning[J].Knowledge-Based Systems, 2020, 195:105618.
[5] OLIVER A, VAZQUEZ M.TermEval 2020:using TSR filtering method to improve automatic term extraction[C]//Proceedings of the 6th IEEE International Workshop on Computational Terminology.Washington D.C., USA:IEEE Press, 2020:106-113.
[6] 李思良, 许斌, 杨玉基.DRTE:面向基础教育的术语抽取方法[J].中文信息学报, 2018, 32(3):101-109. LI S L, XU B, YANG Y J.DRTE:a term extraction method for K12 education[J].Journal of Chinese Information Processing, 2018, 32(3):101-109.(in Chinese)
[7] PAIS V, ION R.TermEval 2020:RACAI's automatic term extraction system[C]//Proceedings of the 6th IEEE International Workshop on Computational Terminology.Washington D.C., USA:IEEE Press, 2020:101-105.
[8] CAMPOS R, MANGARAVITE V, PASQUALI A, et al.YAKE! keyword extraction from single documents using multiple local features[J].Information Sciences, 2020, 509:257-289.
[9] CHEN P H, LU Y, ZHENG V W, et al.KnowEdu:a system to construct knowledge graph for education[J].IEEE Access, 2018, 6:31553-31563.
[10] 阳萍, 谢志鹏.基于BiLSTM模型的定义抽取方法[J].计算机工程, 2020, 46(3):40-45. YANG P, XIE Z P.Definition extraction method based on BiLSTM model[J].Computer Engineering, 2020, 46(3):40-45.(in Chinese)
[11] 杨一帆, 陈文亮.旅游场景下的实体别名抽取联合模型[J].中文信息学报, 2020, 34(6):55-63. YANF Y F, CHEN W L.Joint model for entity alias extraction in tourism domain[J].Journal of Chinese Information Processing, 2020, 34(6):55-63.(in Chinese)
[12] WU F Z, LIU J X, WU C H, et al.Neural Chinese named entity recognition via CNN-LSTM-CRF and joint training with word segmentation[C]//Proceedings of World Wide Web Conference.New York, USA:ACM Press, 2019:3342-3348.
[13] 仇瑜, 程力.面向财税领域的实体识别与标注研究[J].计算机工程, 2020, 46(5):312-320. QIU Y, CHENG L.Research on entity recognition and tagging in fiscal and taxation domain[J].Computer Engineering, 2020, 46(5):312-320.(in Chinese)
[14] WANG X C, FENG W Z, TANG J, et al.Course concept extraction in MOOC via explicit/implicit representation[C]//Proceedings of the 3rd IEEE International Conference on Data Science in Cyberspace.Washington D.C., USA:IEEE Press, 2018:339-345.
[15] 郑玉艳, 田莹, 石川.一种元路径下基于频繁模式的实体集扩展方法[J].软件学报, 2018, 29(10):2915-2930. ZHENG Y Y, TIAN Y, SHI C.Method of entity set expansion based on frequent pattern under meta path[J].Journal of Software, 2018, 29(10):2915-2930.(in Chinese)
[16] CHEN J, CHEN Y G, ZHANG X L, et al.Entity set expansion with semantic features of knowledge graphs[J].Journal of Web Semantics, 2018, 52:33-44.
[17] YU J F, WANG C Y, LUO G, et al.Course concept expansion in MOOCs with external knowledge and interactive game[C]//Proceedings of the 57th IEEE Annual Meeting of the Association for Computational Linguistics.Washington D.C., USA:IEEE Press, 2019:4292-4302.
[18] PAN L, WANG X, LI C, et al.Course concept extraction in MOOCs via embedding-based graph propagation[C]//Proceedings of the 8th IEEE International Joint Conference on Natural Language Processing.Washington D.C., USA:IEEE Press, 2017:875-884.
[19] BOUDIN F.Unsupervised keyphrase extraction with multipartite graphs[C]//Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 2(Short Papers).Stroudsburg, USA:Association for Computational Linguistics, 2018:18-26.
[20] CHI L, HU L.ISKE:an unsupervised automatic keyphrase extraction approach using the iterated sentences based on graph method[J].Knowledge-Based Systems, 2021, 223:107014.
[21] FLORESCU C, CARAGEA C.PositionRank:an unsupervised approach to keyphrase extraction from scholarly documents[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers).Stroudsburg, USA:Association for Computational Linguistics, 2017:209-307.
[22] FERRAGINA P, SCAIELLA U.TAGME:on-the-fly annotation of short text fragments[C]//Proceedings of the 19th ACM International Conference on Information and Knowledge Management.New York, USA:ACM Press, 2010:99-108.
[23] WITTEN I H, MILNE D N.An effective, low-cost measure of semantic relatedness obtained from Wikipedia links[C]//Proceedings of AAAI Workshop on Wikipedia and Artificial Intelligence.Chicago, USA:AAAI Press, 2008:25-30.
[24] STOILOS G, STAMOU G, KOLLIAS S.A string metric for ontology alignment[C]//Proceedings of International Semantic Web Conference.Berlin, Germany:Springer, 2005:624-637.
[25] DEMARTINI G, IOFCIU T, VRIES A P D.Overview of the INEX 2009 entity ranking track[C]//Proceedings of the 8th International Conference on Initiative for the Evaluation of XML Retrieval.Berlin, Germany:Springer, 2009:254-264.

选择文件类型/文献管理软件名称

选择包含的内容