作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (5): 56-62. doi: 10.19678/j.issn.1000-3428.0064584

• 人工智能与模式识别 • 上一篇    下一篇

基于字词融合与对抗训练的行业人物实体识别

朱红1, 牛浩然1, 朱彤2   

  1. 1. 中国矿业大学(北京) 机电与信息工程学院, 北京 100083;
    2. 中国矿业大学(北京) 档案馆, 北京 100083
  • 收稿日期:2022-04-28 修回日期:2022-06-23 发布日期:2022-08-31
  • 作者简介:朱红(1974-),女,副教授,主研方向为计算机视觉、自然语言处理;牛浩然(通信作者),硕士研究生;朱彤,副研究员。
  • 基金资助:
    2022年度北京市档案局科研项目“档案知识图谱构建方法研究——以行业人物档案为例”(2022-12)。

Entity Recognition of Industry Figures Based on Character and Word Fusion and Adversarial Training

ZHU Hong1, NIU Haoran1, ZHU Tong2   

  1. 1. School of Mechanical Electronic & Information Engineering, China University of Mining and Technology-Beijing, Beijing 100083, China;
    2. Archives, China University of Mining and Technology-Beijing, Beijing 100083, China
  • Received:2022-04-28 Revised:2022-06-23 Published:2022-08-31

摘要: 行业人物命名实体识别旨在从行业人物语料中抽取出有效的实体信息,是深度挖掘行业人物信息资源的基础性和关键性任务。由于主流的命名实体识别模型未充分利用词信息特征,导致在对具有特色的行业人物实体进行识别时语义、实体边界识别不准确。提出一种基于字词融合与对抗训练的行业人物实体识别模型。利用RoBERTa-wwm-ext预训练模型提取句子的字特征,并融合词典构造出句子的词特征。在字词融合的向量表示上添加扰动生成对抗样本,将融合向量表示与对抗样本作为训练数据输入双向长短期记忆神经网络(BiLSTM)学习上下文信息,并通过条件随机场(CRF)推理最优的序列标注结果。根据行业人物文本特点设计命名实体标注方案,构建数据集进行实验验证。实验结果表明,该模型在测试集上的精确率、召回率、F1值分别达到92.94%、94.35%、93.64%,相较于BERT-BiLSTM-CRF模型分别提升3.68、1.24、2.39个百分点。

关键词: 命名实体识别, 行业人物, 字词融合, 对抗训练, 预训练模型

Abstract: Named Entity Recognition(NER) of industry figures involves the extraction of effective entity information from a corpus of industry figures,which is a fundamental and critical task for the deep mining of industry figures information resources.However,the mainstream NER models suffer from semantic and entity boundary recognition inaccuracies when recognizing industry figures entities with characteristics owing to the underutilization of word information features. Therefore, this paper proposes an industry figures entity recognition model based on character and word fusion and adversarial training. First,the character features of the sentence are extracted using the RoBERTa-wwm-ext pre-training model,and the word features of the sentence are constructed by fusing the lexicon. Then,adversarial samples are generated by adding perturbations to the vector representation of the character and word fusion. The fused vector representation and adversarial samples are input into a BiLSTM neural network to further extract contextual information. Finally,the optimal annotation sequence is obtained using the Conditional Random Field(CRF). A dataset is built for experimental validation and labeled according to the characteristics of industry figures texts. The experimental results for the model on the test set show that precision reaches 92.94%,recall reaches 94.35%,and F1 score reaches 93.64%. Compared with the mainstream BERT-BiLSTM-CRF model,the precision,recall,and F1 score are improved by 3.68,1.24,and 2.39 percentage points,respectively.

Key words: Named Entity Recognition(NER), industry figures, character and word fusion, adversarial training, pre-training model

中图分类号: