作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (11): 119-129. doi: 10.19678/j.issn.1000-3428.0068258

• 人工智能与模式识别 • 上一篇    下一篇

基于Wobert与对抗学习的中文命名实体识别

倪渊1,2, 廖世豪3,*(), 张健1,2   

  1. 1. 北京信息科技大学经济管理学院, 北京 100192
    2. 绿色发展大数据决策北京市重点实验室, 北京 100192
    3. 北京信息科技大学计算机学院, 北京 100192
  • 收稿日期:2023-08-17 出版日期:2024-11-15 发布日期:2024-03-05
  • 通讯作者: 廖世豪
  • 基金资助:
    国家重点研发计划青年科学家项目(2021YFF0900200)

Chinese Named Entity Recognition Based on Wobert and Adversarial Learning

NI Yuan1,2, LIAO Shihao3,*(), ZHANG Jian1,2   

  1. 1. College of Economics and Management, Beijing Information Science and Technology University, Beijing 100192, China
    2. Beijing Key Laboratory of Green Development Big Data Decision, Beijing 100192, China
    3. College of Computer Science, Beijing Information Science and Technology University, Beijing 100192, China
  • Received:2023-08-17 Online:2024-11-15 Published:2024-03-05
  • Contact: LIAO Shihao

摘要:

由于自然语言处理(NLP)将中文命名实体识别(NER)任务建模为序列标注任务, 将文本中每个字符映射至一个标签, 每个字符相对独立且信息有限, 因此在NER领域词汇信息的加入能够解决字符间缺乏联系的问题。针对现有中文NER模型多需要额外构建词汇表、提取词汇信息方式繁琐、词级嵌入与字级嵌入因来源不同导致信息难以融合的问题, 提出一种基于Wobert与对抗学习的中文NER模型ALWAE-BiLSTM-CRF。与传统预训练模型相比, Wobert预训练模型在预训练阶段就已将文本分词, 充分学习了词与字两个层次的信息, 因此ALWAE-BiLSTM-CRF通过Wobert预训练模型获取字符词向量, 再使用Wobert分词器获取预训练模型中已存在的词汇向量, 接着使用BiLSTM模型获取两者的时序信息, 随后通过多头注意力机制将词汇级别的信息要素融入字符词向量, 同时通过对抗学习攻击生成对抗样本以增强模型泛化性, 最后使用条件随机场(CRF)层对结果进行约束, 获得最佳的预测序列。在Resume数据集与瓷器领域的自建数据集Porcelain上进行对比实验与消融实验, 结果表明, ALWAE-BiLSTM-CRF模型的F1值分别达到97.21%与85.7%, 证明了其在中文NER任务中的有效性。

关键词: 深度学习, 命名实体识别, 注意力机制, 特征融合, 条件随机场

Abstract:

Natural Language Processing (NLP) models the Chinese Named Entity Recognition (NER) task as a sequence annotation task and maps each character in the text to a label. Each character is relatively independent and has limited information. Therefore, the addition of vocabulary information to the NER field can solve the problem of the lack of connections between characters. To address the challenges of existing Chinese NER models that require additional vocabulary construction, employ a cumbersome extraction process of vocabulary information, and have difficulties integrating information due to different sources of word-level embedding, this study proposes a Chinese NER model based on Wobert and adversarial learning named ALWAE-BiLSTM-CRF. Unlike traditional pre-training models, the Wobert pre-training model segments the text in advance (i.e., during the pre-training stage), thereby fully learning information at both the word and character levels. Accordingly, the proposed model obtains character word vectors through the Wobert pre-training model and then uses the Wobert word splitter to obtain the existing vocabulary vector in the pre-training model. The proposed model next uses the BiLSTM model to obtain the temporal information of the two. The model then utilizes a multi-head attention mechanism to integrate vocabulary-level information elements into the character word vector while simultaneously generating adversarial samples through adversarial learning attacks to enhance model generalization. Finally, the proposed model utilizes a Conditional Random Field (CRF) layer to constrain the results and obtain the best prediction sequence. The study conducted comparative and ablation experiments on the Resume and self-built Porcelain datasets in the field of porcelains, the results show that the ALWAE-BiLSTM-CRF model achieves 97.21% and 85.7% F1 values on the two datasets, proving its effectiveness in the Chinese NER task.

Key words: deep learning, Named Entity Recognition(NER), attention mechanism, feature fusion, Conditional Random Field(CRF)