作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (7): 324-332. doi: 10.19678/j.issn.1000-3428.0067645

• 开发研究与工程应用 • 上一篇    下一篇

基于注意力增强与特征融合的中文医学实体识别

王晋涛1, 秦昂2, 张元1, 陈一飞2, 王廷凤1, 谢承霖3, 邹刚1,4,*()   

  1. 1. 中北大学计算机科学与技术学院, 山西 太原 030051
    2. 湖南省肿瘤医院, 湖南 长沙 410031
    3. 湖南省中医药研究院附属医院, 湖南 长沙 410006
    4. 湖南中科助英智能科技研究院, 湖南 长沙 410076
  • 收稿日期:2023-05-18 出版日期:2024-07-15 发布日期:2024-05-15
  • 通讯作者: 邹刚
  • 基金资助:
    湖南省自然科学基金(2022JJ70022)

Chinese Medical Entity Recognition Based on Attention Enhancement and Feature Fusion

Jintao WANG1, Ang QIN2, Yuan ZHANG1, Yifei CHEN2, Tingfeng WANG1, Chenglin XIE3, Gang ZOU1,4,*()   

  1. 1. School of Computer Science and Technology, North University of China, Taiyuan 030051, Shanxi, China
    2. Hunan Provincial Tumor Hospital, Changsha 410031, Hunan, China
    3. The Affiliated Hospital of Hunan Academy of Traditional Chinese Medicine, Changsha 410006, Hunan, China
    4. Hunan ZK Help Innovation Intelligent Technology Research Institute, Changsha 410076, Hunan, China
  • Received:2023-05-18 Online:2024-07-15 Published:2024-05-15
  • Contact: Gang ZOU

摘要:

针对基于字符表示的中文医学领域命名实体识别模型嵌入形式单一、边界识别困难、语义信息利用不充分等问题, 一种非常有效的方法是在Bret底层注入词汇特征, 在利用词粒度语义信息的同时降低分词错误带来的影响, 然而在注入词汇信息的同时也会引入一些低相关性的词汇和噪声, 导致基于注意力机制的Bret模型出现注意力分散的情况。此外仅依靠字、词粒度难以充分挖掘中文字符深层次的语义信息。对此, 提出基于注意力增强与特征融合的中文医学实体识别模型, 对字词注意力分数矩阵进行稀疏处理, 使模型的注意力集中在相关度高的词汇, 能够有效减少上下文中的噪声词汇干扰。同时, 对汉字发音和笔画通过卷积神经网络(CNN)提取特征, 经过迭代注意力特征融合模块进行融合, 然后与Bret模型的输出特征进行拼接输入给BiLSTM模型, 进一步挖掘字符所包含的深层次语义信息。通过爬虫等方式搜集大量相关医学语料, 训练医学领域词向量库, 并在CCKS2017和CCKS2019数据集上进行验证, 实验结果表明, 该模型F1值分别达到94.90%、89.37%, 效果优于当前主流的实体识别模型, 具有更好的识别效果。

关键词: 实体识别, 中文分词, 注意力稀疏, 特征融合, 医学词向量库

Abstract:

To address problems such as single embedding forms, difficult boundary recognition, and insufficient use of semantic information in Chinese medical named entity recognition models based on character representation, an effective method is to inject lexical features at the bottom of Bret. This approach reduces the impact of word segmentation errors while utilizing word granularity semantic information. However, some low correlation words and noise are introduced when vocabulary information is injected, leading to attention distraction in the Bret model based on the attention mechanism. In addition, it is difficult to fully mine deep semantic information of Chinese characters by relying solely on word granularity. Therefore, this study proposes a Chinese medical entity recognition model based on attention enhancement and feature fusion. The sparse processing of the attention score matrix of words causes the model to focus on words with a high correlation, which can effectively reduce the interference of noisy words in the context. Simultaneously, Convolutional Neural Networks (CNNs) are used to extract the features of Chinese pronunciation and strokes, which are fused with the output features of the Bret model through an iterative attention feature fusion module and subsequently concatenated to the BiLSTM model to further mine the deep semantic information contained in characters. During the experiment, a large number of relevant medical corpora is collected using a crawler and other methods. Further, a medical field word vector library is trained and verified on the CCKS2017 and CCKS2019 datasets. The experimental results show that the F1 values of the model reach 94.90% and 89.37%, respectively, which are higher than those with current mainstream entity recognition models. Therefore, the proposed model exhibits higher recognition performance.

Key words: entity recognition, Chinese word segmentation, sparse attention, feature fusion, medical word vector library