Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2022, Vol. 48 ›› Issue (3): 107-114,145. doi: 10.19678/j.issn.1000-3428.0060466

• Artificial Intelligence and Pattern Recognition • Previous Articles     Next Articles

Vietnamese Event Entity Recognition Combining Dictionary and Adversarial Transfer

XUE Zhenyu1,2, XIAN Yantuan1,2, YU Zhengtao1,2, GAO Shengxiang1,2, PU Liuqing1,2   

  1. 1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China;
    2. Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, China
  • Received:2021-01-04 Revised:2021-03-01 Published:2021-03-09

融合词典与对抗迁移的越南语事件实体识别

薛振宇1,2, 线岩团1,2, 余正涛1,2, 高盛祥1,2, 普浏清1,2   

  1. 1. 昆明理工大学 信息工程与自动化学院, 昆明 650500;
    2. 昆明理工大学 云南省人工智能重点实验室, 昆明 650500
  • 作者简介:薛振宇(1996-),男,硕士研究生,主研方向为自然语言处理、跨语言信息检索;线岩团,副教授、硕士;余正涛,教授、博士;高盛祥,副教授、博士;普浏清,硕士研究生。
  • 基金资助:
    国家自然科学基金(61972186,61762056,61472168);云南省重大科技专项计划(202002AD080001);云南省高新技术产业专项(201606)。

Abstract: The problem of the scarcity of Vietnamese event annotated corpus, comprising several unregistered words, reduces the accuracy of entity recognition.This study proposes an entity recognition model that combines a dictionary and adversarial transfer.It uses Vietnamese as the target language, and English and Chinese as the source languages.Furthermore, the entity tagging information of the source language and bilingual dictionary are used to improve the entity recognition of the target language.The semantic space is shared between the source and target languages by word-level adversarial transfer.Moreover, multi-granular features are embedded into bilingual dictionary to enrich the semantic representation of target language words, and sentence-level adversarial transfer is used to extract language-independent sequence features.Finally, the entity recognition result is marked by a Conditional Random Field(CRF) inference module.The experimental results on the Vietnamese news dataset demonstrate that the proposed model has improved entity recognition compared to the mainstream monolingual entity recognition model and transfer learning model when the source languages are English and Chinese.After adding the target semantic annotation data, the F1-score of the monolingual entity recognition transfer learning model when the source languages are English and Chinese increased by 19.61% and 18.73%, respectively.

Key words: entity recognition, adversarial transfer, bilingual dictionary, multi-granular feature, sequence feature

摘要: 针对越南语事件标注语料稀缺且标注语料中未登陆词过多导致实体识别精度降低的问题,提出一种融合词典与对抗迁移的实体识别模型。将越南语作为目标语言,英语和汉语作为源语言,通过源语言的实体标注信息和双语词典提升目标语言的实体识别效果。采用词级别对抗迁移实现源语言与目标语言的语义空间共享,融合双语词典进行多粒度特征嵌入以丰富目标语言词的语义表征,再使用句子级别对抗迁移提取与语言无关的序列特征,最终通过条件随机场推理模块标注实体识别结果。在越南语新闻数据集上的实验结果表明,在源语言为英语和汉语的情况下,该模型相比主流的单语实体识别模型和迁移学习模型的实体识别性能有明显提升,并且在加入目标语义标注数据后,相比单语实体识别模型的F1值分别增加了19.61和18.73个百分点。

关键词: 实体识别, 对抗迁移, 双语词典, 多粒度特征, 序列特征

CLC Number: