作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

融合多级别粒度特征的中文医疗文本实体识别

  • 发布日期:2025-05-14

Chinese Medical Named Entity Recognition Integrating Multi-Level Granular Features

  • Published:2025-05-14

摘要: 中文医疗命名实体识别旨在从医疗文本中识别具有特定意义的实体,如疾病、药物、症状及身体部位等多种类型的医疗实体。这一任务可为临床辅助决策、医疗信息整合和病案管理等方面提供有力支持。现有的中文医疗命名实体识别研究尚未充分考虑医疗文本的复杂结构,存在专业术语繁多、嵌入信息单一以及语义信息利用不足等问题。为此,提出了一种融合多级别粒度特征的中文医疗文本命名实体识别模型。该模型首先利用BERT预训练模型生成文本的字嵌入表示,并设计了一维卷积神经网络和二维卷积神经网络提取字符的字形与笔画特征,同时通过外部词库引入词级特征,以增强对词与实体边界的信息表达。此外,模型还加入句子级特征以捕获全局语义特征。通过交叉注意力机制将上述多级别的粒度特征进行迭代融合,得到包含深层语义信息的嵌入表示,最后并利用条件随机场输出实体识别结果。在CCKS2017和CCKS2019数据集上的实验结果表明,该模型F1分数达到92.88%和87.86%,相较于当前主流模型展现了更优异的识别性能。

Abstract: Chinese medical named entity recognition aims to identify entities with specific meanings from medical texts, such as diseases, drugs, symptoms, and anatomical parts. This task provides robust support for clinical decision-making, medical information integration, and medical record management. However, existing research on Chinese medical named entity recognition has not fully addressed the complexity of medical texts, which are characterized by the abundance of specialized terminology, limited embedding diversity, and insufficient utilization of semantic information. To address these issues, this paper proposes a Chinese medical named entity recognition model that integrates multi-granularity features. The model first employs the BERT pre-trained model to generate character embeddings for the text. It then uses both one-dimensional and two-dimensional convolutional neural networks to extract character shape and stroke features, while external lexicons are incorporated to introduce word-level features, enhancing the representation of word and entity boundaries. Additionally, sentence-level features are included to capture global semantic information. A cross-attention mechanism is utilized to iteratively fuse these multi-granularity features, resulting in embeddings enriched with deep semantic information. Finally, conditional random fields (CRF) are used to output the entity recognition results. Experimental results on the CCKS2017 and CCKS2019 datasets demonstrate that the proposed model achieves F1 scores of 92.88% and 87.86%, respectively, outperforming mainstream models in recognition performance.