作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (12): 71-77. doi: 10.19678/j.issn.1000-3428.0066379

• 人工智能与模式识别 • 上一篇    下一篇

基于位置嵌入和多级预测的中文嵌套命名实体识别

段建勇1,2, 朱奕霏1, 王昊1,2, 何丽1,2, 李欣1,2   

  1. 1. 北方工业大学 信息学院, 北京 100144
    2. CNONIX国家标准应用与推广实验室, 北京 100144
  • 收稿日期:2022-11-28 出版日期:2023-12-15 发布日期:2023-12-14
  • 作者简介:

    段建勇(1978—),男,教授、博士,CCF会员,主研方向为自然语言处理、信息检索

    朱奕霏,硕士研究生

    王昊,副教授、博士

    何丽,副教授、博士

    李欣,讲师、博士

  • 基金资助:
    国家自然科学基金(61972003); 教育部人文社科基金(21YJA740052); 北京市教育委员会科学研究计划项目(KM202210009002)

Chinese Nested Named Entity Recognition Based on Location Embedding and Multilevel Prediction

Jianyong DUAN1,2, Yifei ZHU1, Hao WANG1,2, Li HE1,2, Xin LI1,2   

  1. 1. School of Information, North China University of Technology, Beijing 100144, China
    2. CNONIX National Standard Application and Promotion Laboratory, Beijing 100144, China
  • Received:2022-11-28 Online:2023-12-15 Published:2023-12-14

摘要:

针对传统中文嵌套命名实体识别模型通常存在实体边界难以准确定位及中文字符与词汇之间边界模糊的问题,构建一种基于位置嵌入和多级结果边界预测的嵌套命名实体识别模型。在嵌入层,将嵌套实体位置信息与文本位置信息同时编码后生成绝对位置序列,通过关注中文文本中自带的位置信息,进一步挖掘嵌套实体与字符之间的关系,并且增强了嵌套实体与原始文本之间的联系。在编码层,利用排除最优路径的隐藏矩阵实现嵌套实体的初步识别。在解码层,计算实体边界的偏移量,重新确定实体边界,从而提高中文嵌套实体识别准确率。实验结果表明,在医疗和日常两个领域的数据集上,该模型的准确率、召回率、F1值相比于基线模型中的最优值分别提高了0.34、1.06、0.80和11.90、0.78、6.23个百分点,具有较好的识别性能。

关键词: 嵌套命名实体识别, 位置嵌入, 边界预测单元, 条件随机场, 多级预测

Abstract:

Traditional Chinese nested Named Entity Recognition(NER) models often face problems, such as difficulty in accurately locating entity boundaries and blurred boundaries between Chinese characters and vocabulary. A nested NER model based on position embedding and multilevel result boundary prediction is proposed to address this problem. The position information of nested entities is encoded with the text position information in the embedding layer. An absolute position sequence is then generated, which further examines the relationship between the nested entities and characters and strengthens the connection between the nested entities and the original text by focusing on the position information in the Chinese text. At the encoding layer, the nested entities are initially identified using a hidden matrix that excludes the best path with multilevel prediction. At the decoding layer, the offset of entity boundaries is calculated at the multilevel prediction layer to redefine the entity boundaries, and improve the accuracy of Chinese entity prediction. The experimental results show that the proposed model improves the precision, recall, and F1-value by 0.34, 1.06, and 0.80 percentage points, respectively, on the medical domain dataset, and by 11.90, 0.78, and 6.23 percentage points, respectively, on the daily domain dataset compared to the highest value in the baseline models. This study demonstrates that the proposed model exhibits high performance in recognizing Chinese nested named entities.

Key words: nested Named Entity Recognition(NER), location embedding, Boundary Prediction Unit(BPU), Conditional Random Field(CRF), multilevel prediction