基于注意力增强与特征融合的中文医学实体识别

doi:10.19678/j.issn.1000-3428.0067645

摘要/Abstract

摘要：

针对基于字符表示的中文医学领域命名实体识别模型嵌入形式单一、边界识别困难、语义信息利用不充分等问题, 一种非常有效的方法是在Bret底层注入词汇特征, 在利用词粒度语义信息的同时降低分词错误带来的影响, 然而在注入词汇信息的同时也会引入一些低相关性的词汇和噪声, 导致基于注意力机制的Bret模型出现注意力分散的情况。此外仅依靠字、词粒度难以充分挖掘中文字符深层次的语义信息。对此, 提出基于注意力增强与特征融合的中文医学实体识别模型, 对字词注意力分数矩阵进行稀疏处理, 使模型的注意力集中在相关度高的词汇, 能够有效减少上下文中的噪声词汇干扰。同时, 对汉字发音和笔画通过卷积神经网络(CNN)提取特征, 经过迭代注意力特征融合模块进行融合, 然后与Bret模型的输出特征进行拼接输入给BiLSTM模型, 进一步挖掘字符所包含的深层次语义信息。通过爬虫等方式搜集大量相关医学语料, 训练医学领域词向量库, 并在CCKS2017和CCKS2019数据集上进行验证, 实验结果表明, 该模型F1值分别达到94.90%、89.37%, 效果优于当前主流的实体识别模型, 具有更好的识别效果。

关键词: 实体识别, 中文分词, 注意力稀疏, 特征融合, 医学词向量库

Abstract:

To address problems such as single embedding forms, difficult boundary recognition, and insufficient use of semantic information in Chinese medical named entity recognition models based on character representation, an effective method is to inject lexical features at the bottom of Bret. This approach reduces the impact of word segmentation errors while utilizing word granularity semantic information. However, some low correlation words and noise are introduced when vocabulary information is injected, leading to attention distraction in the Bret model based on the attention mechanism. In addition, it is difficult to fully mine deep semantic information of Chinese characters by relying solely on word granularity. Therefore, this study proposes a Chinese medical entity recognition model based on attention enhancement and feature fusion. The sparse processing of the attention score matrix of words causes the model to focus on words with a high correlation, which can effectively reduce the interference of noisy words in the context. Simultaneously, Convolutional Neural Networks (CNNs) are used to extract the features of Chinese pronunciation and strokes, which are fused with the output features of the Bret model through an iterative attention feature fusion module and subsequently concatenated to the BiLSTM model to further mine the deep semantic information contained in characters. During the experiment, a large number of relevant medical corpora is collected using a crawler and other methods. Further, a medical field word vector library is trained and verified on the CCKS2017 and CCKS2019 datasets. The experimental results show that the F1 values of the model reach 94.90% and 89.37%, respectively, which are higher than those with current mainstream entity recognition models. Therefore, the proposed model exhibits higher recognition performance.

Key words: entity recognition, Chinese word segmentation, sparse attention, feature fusion, medical word vector library

王晋涛, 秦昂, 张元, 陈一飞, 王廷凤, 谢承霖, 邹刚. 基于注意力增强与特征融合的中文医学实体识别[J]. 计算机工程, 2024, 50(7): 324-332.

Jintao WANG, Ang QIN, Yuan ZHANG, Yifei CHEN, Tingfeng WANG, Chenglin XIE, Gang ZOU. Chinese Medical Entity Recognition Based on Attention Enhancement and Feature Fusion[J]. Computer Engineering, 2024, 50(7): 324-332.

https://www.ecice06.com/CN/Y2024/V50/I7/324

图/表 13

图1 模型构建流程

Fig.1 Procedure of model construction

图2 Top-k注意力分数筛选图

Fig.2 Filtering graph of Top-k attention scores

图3 多尺度通道注意力模块

Fig.3 Module of multi-scale channels attention

图4 LSMT单元结构

Fig.4 LSMT unit structure

图5 CCKS2017部首分布

Fig.5 Radical distribution of CCKS2017

图6 CCKS2019部首分布

Fig.6 Radical distribution of CCKS2019

参考文献 24

1	KE J, WANG W J, CHEN X J, et al. Medical entity recognition and knowledge map relationship analysis of Chinese EMRs based on improved BiLSTM-CRF. Computers and Electrical Engineering, 2023, 108, 108709. doi: 10.1016/j.compeleceng.2023.108709
2	杜晋华, 尹浩, 冯嵩. 中文电子病历命名实体识别的研究与进展. 电子学报, 2022, 50(12): 3030- 3053. URL
	DU J H, YIN H, FENG S. Research and development of named entity recognition in Chinese electronic medical record. Acta Electronica Sinica, 2022, 50(12): 3030- 3053. URL
3	CHEN C, KONG F. Enhancing entity boundary detection for better Chinese named entity recognition[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Washington D. C., USA: IEEE Press, 2021: 20-25.
4	COOK H V, JENDEN L J. A guide to dictionary-based text mining. Bioinformatics and Drug Discovery, 2019, 1939, 73- 89.
5	ZHANG Y, YANG J. Chinese NER using lattice LSTM[EB/OL]. [2023-04-10]. http://arxiv.org/abs/1805.02023v4.
6	LI Y, DU G D, XIANG Y, et al. Towards Chinese clinical named entity recognition by dynamic embedding using domain-specific knowledge. Journal of Biomedical Informatics, 2020, 106, 103435. doi: 10.1016/j.jbi.2020.103435
7	YIN M W, MOU C J, XIONG K N, et al. Chinese clinical named entity recognition with radical-level feature and self-attention mechanism. Journal of Biomedical Informatics, 2019, 98, 103289. doi: 10.1016/j.jbi.2019.103289
8	SHI J T, SUN M X, SUN Z Y, et al. Multi-level semantic fusion network for Chinese medical named entity recognition. Journal of Biomedical Informatics, 2022, 133, 104144. doi: 10.1016/j.jbi.2022.104144
9	ZHAO G X, LIN J Y, ZHANG Z Y, et al. Explicit sparse transformer: concentrated attention through explicit selection[EB/OL]. [2023-04-10]. http://arxiv.org/abs/1912.11637v1.
10	LIU W, FU X Y, ZHANG Y, et al. Lexicon enhanced Chinese sequence labeling using BERT adapter[EB/OL]. [2023-04-10]. http://arxiv.org/abs/2105.07148v3.
11	DAI Y, GIESEKE F, OEHMCKE S, et al. Attentional feature fusion[C]//Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision. Washington D. C., USA: IEEE Press, 2021: 3560-3569.
12	YANG X, HUANG W. A conditional random fields approach to biomedical named entity recognition. Journal of Electronics, 2007, 24(6): 838- 844.
13	LIU W, XU T, XU Q, et al. An encoding strategy based word-character LSTM for Chinese NER[C]//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Washington D. C., USA: IEEE Press, 2019: 2379-2389.
14	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2023-04-10]. http://arxiv.org/abs/1810.04805v2.
15	SONG Y, SHI S, LI J, et al. Directional skip-gram: explicitly distinguishing left and right context for word embeddings[C]//Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Washington D. C., USA: IEEE Press, 2018: 175-180.
16	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. [2023-04-10]. http://arxiv.org/abs/1301.3781v3.
17	LUO R X, XU J J, ZHANG Y, et al. PKUSEG: a toolkit for multi-domain Chinese word segmentation[EB/OL]. [2023-04-10]. http://arxiv.org/abs/1906.11455v3.
18	李军怀, 陈苗苗, 王怀军, 等. 基于ALBERT-BGRU-CRF的中文命名实体识别方法. 计算机工程, 2022, 48(6): 89-94, 106. URL
	LI J H, CHEN M M, WANG H J, et al. Chinese named entity recognition method based on ALBERT-BGRU-CRF. Computer Engineering, 2022, 48(6): 89-94, 106. URL
19	孙振, 李新福. 多特征融合的中文电子病历命名实体识别. 计算机工程与应用, 2023, 59(23): 136- 144. doi: 10.3778/j.issn.1002-8331.2207-0455
	SUN Z, LI X F. Named entity recognition of Chinese electronic medical records based on multi-feature fusion. Computer Engineering and Applications, 2023, 59(23): 136- 144. doi: 10.3778/j.issn.1002-8331.2207-0455
20	KONG J, ZHANG L X, JIANG M, et al. Incorporating multi-level CNN and attention mechanism for Chinese clinical named entity recognition. Journal of Biomedical Informatics, 2021, 116, 103737. doi: 10.1016/j.jbi.2021.103737
21	乔锐, 杨笑然, 黄文亢. 基于Bret与模型融合的医疗命名实体识别[C]//全国知识图谱与语义计算会议评估任务论文集. 杭州: [s. n.], 2019: 1-6.
	QIAO R, YANG X R, HUANG W K. Medical named entity recognition based on Bret and model fusion [C]//Proceedings of Evaluation Tasks at the China Conference on Knowledge Graph and Semantic Computing. Hangzhou: [s. n.], 2019: 1-6. (in Chinese)
22	AN Y, XIA X Y, CHEN X L, et al. Chinese clinical named entity recognition via multi-head self-attention based BiLSTM-CRF. Artificial Intelligence in Medicine, 2022, 127, 102282. doi: 10.1016/j.artmed.2022.102282
23	CAO J, ZHOU X, XIONG W, et al. Electronic medical record entity recognition via machine reading comprehension and biaffine. Discrete Dynamics in Nature and Society, 2021,(9): 1- 8.
24	WANG C Y, WANG H, ZHUANG H, et al. Chinese medical named entity recognition based on multi-granularity semantic dictionary and multimodal tree. Journal of Biomedical Informatics, 2020, 111, 103583. doi: 10.1016/j.jbi.2020.103583

[1]	党小超, 刘涧, 董晓辉, 祝忠彦, 李芬芳. 面向不平衡数据的机械设备故障命名实体识别[J]. 计算机工程, 2024, 50(9): 104-112.
[2]	李俊仪, 李向阳, 龙朝勋, 李海燕, 李红松, 余鹏飞. 基于多级区域选择与跨层特征融合的野生菌分类[J]. 计算机工程, 2024, 50(9): 179-188.
[3]	张华青, 夏张涛, 陆晓庆, 童基均. 基于字形特征的血管外科命名实体识别[J]. 计算机工程, 2024, 50(8): 13-21.
[4]	李华昱, 张智康, 闫阳, 岳阳. 基于知识图谱增强的领域多模态实体识别[J]. 计算机工程, 2024, 50(8): 31-39.
[5]	刘锁兰, 王炎, 王洪元, 朱生升. 基于多流语义图卷积网络的人体行为识别[J]. 计算机工程, 2024, 50(8): 64-74.
[6]	赵婉秋, 张俊虎, 李海涛. 用于建筑物分割的平行结构特征融合网络[J]. 计算机工程, 2024, 50(8): 239-248.
[7]	赵宏, 王枭. 基于Swin-Transformer的黑色素瘤图像病灶分割研究[J]. 计算机工程, 2024, 50(8): 249-258.
[8]	王富平, 刘鸿玮, 张锲石, 段冠庄. 基于深度特征抑制的遮挡人脸识别网络[J]. 计算机工程, 2024, 50(8): 259-269.
[9]	闵莉, 董冰洁, 安冬. 基于多注意力机制与跨特征融合的语义分割算法[J]. 计算机工程, 2024, 50(8): 282-289.
[10]	陈宇航, 杨勇, 先木斯亚·买买提明, 帕力旦·吐尔逊, 樊小超, 任鸽, 刁宇峰. 基于主题感知和语义增强的作文自动评分方法[J]. 计算机工程, 2024, 50(8): 363-371.
[11]	谭巨全, 王然. 特征融合下田径录像3D人体动作DTW捕捉算法[J]. 计算机工程, 2024, 50(7): 71-78.
[12]	张溢文, 蔡满春, 陈咏豪, 朱懿, 姚利峰. 融合空间特征的多尺度深度伪造检测方法[J]. 计算机工程, 2024, 50(7): 240-250.
[13]	李亚康, 陈刚. 小角中子散射物理模型自动化筛选[J]. 计算机工程, 2024, 50(6): 56-64.
[14]	杨硕, 王一丁. 基于改进薄板样条运动模型的人脸动画算法[J]. 计算机工程, 2024, 50(6): 255-265.
[15]	梁松林, 林伟, 王珏, 杨庆. 面向后渗透攻击行为的网络恶意流量检测研究[J]. 计算机工程, 2024, 50(5): 128-138.

选择文件类型/文献管理软件名称

选择包含的内容