作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (3): 121-127. doi: 10.19678/j.issn.1000-3428.0064077

• 人工智能与模式识别 • 上一篇    下一篇

面向地质领域的实体关系联合抽取研究

吴雪莹1, 段友祥1, 昌伦杰2, 李世银2, 孙歧峰1   

  1. 1. 中国石油大学(华东) 计算机科学与技术学院, 山东 青岛 266580;
    2. 中国石油塔里木油田分公司 勘探开发研究院, 新疆 库尔勒 841000
  • 收稿日期:2022-03-02 修回日期:2022-05-05 发布日期:2022-05-25
  • 作者简介:吴雪莹(1998—),女,硕士研究生,主研方向为自然语言处理;段友祥,教授、博士;昌伦杰,教授级高级工程师;李世银,高级工程师;孙歧峰,讲师、博士。
  • 基金资助:
    中石油重大科技项目(ZD2019-183-006);中央高校基本科研业务费专项资金(20CX05017A)。

Research on Entity and Relation Joint Extraction for Geological Domain

WU Xueying1, DUAN Youxiang1, CHANG Lunjie2, LI Shiyin2, SUN Qifeng1   

  1. 1. College of Computer Science and Technology, China University of Petroleum(East China), Qingdao 266580, Shandong, China;
    2. Research Institute of Exploration & Development, PetroChina Tarim Oilfield Company, Korla 841000, Xinjiang, China
  • Received:2022-03-02 Revised:2022-05-05 Published:2022-05-25

摘要: 构建地质领域的知识图谱有助于便捷高效地共享和应用多源地质知识,而地质关系三元组抽取对地质领域知识图谱构建具有重要意义。针对现有实体关系联合抽取模型无法有效识别重叠三元组的问题,考虑地质领域专业知识的特殊性,基于预训练语言模型BERT建立一种用于地质领域关系三元组抽取的层级标注模型HtERT。采用中文预训练语言模型BERT-wwm替代原始的BERT模型作为底层编码器,以提高模型对中文的编码能力。在实体识别阶段,引入关于实体起始位置的嵌入表示来限制实体的抽取长度,从而提升实体识别的准确度。引入全局上下文信息和BiLSTM网络使得模型抽取到的特征能更精确地表示地质样本信息,增强模型对地质关系三元组以及重叠三元组的抽取能力。在地质领域数据集上的实验结果表明,HtERT模型相比于PCNN、BiLSTM、PCNN+ATT、CASREL等基线模型具有明显优势,精确率、召回率以及F1值分别平均提升15.24、10.96和13.20个百分点,验证了该模型在地质领域实体关系联合抽取任务中的有效性。

关键词: 实体关系抽取, 联合抽取, 重叠三元组, 地质领域, 预训练模型BERT

Abstract: Constructing a knowledge graph in the geological domain facilitates convenient and efficient sharing and utilization of multi-source geological knowledge, with the extraction of geological relation triples being of paramount importance.Aiming to address the inadequacy of existing entity and relationship joint extraction models in effectively identifying overlapping triples, this paper proposes a hierarchical annotation model, HtERT, for geological domain relationship triplet extraction, leveraging the specificities of geological domain expertise and the pretrained language model BERT.The original BERT model is replaced with the Chinese pretraining language model BERT-wwm as the underlying encoder, thereby enhancing the model's coding capacity for Chinese.During entity recognition, the embedded representation of the initial position of the entity is introduced to limit the extraction length of the entity and to improve the accuracy of entity recognition.Introducing global context information and a BiLSTM network enables the features extracted from the model to more accurately represent the geological sample information, and enhances the ability of the model to extract geological relation triples and overlapping triples.HtERT outperforms baseline models, such as PCNN, BiLSTM, PCNN+ATT, and CASREL, on the geological domain dataset.The Precision, Recall, and F1 values increase by 15.24, 10.96, and 13.20 percentage points, respectively, which verifies the effectiveness of the model in the joint extraction task of entity and relation in the geological domain.

Key words: entity and relation extraction, joint extraction, overlapping triple, geological domain, pre-training model BERT

中图分类号: