Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2024, Vol. 50 ›› Issue (2): 337-344. doi: 10.19678/j.issn.1000-3428.0067285

• Development Research and Engineering Application • Previous Articles     Next Articles

Chinese Medical Named Entity Recognition Based on Multi-Granularity Glyph Enhancement

Wei LIU1, Lei MA1,*(), Kai LI2, Rong LI3   

  1. 1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, Yunnan, China
    2. Information Department of the First People's Hospital of Yunnan Province, Kunming 650500, Yunnan, China
    3. Scientific Research Department of the First People's Hospital of Yunnan Province, Kunming 650500, Yunnan, China
  • Received:2023-03-27 Online:2024-02-15 Published:2023-07-04
  • Contact: Lei MA

基于多粒度字形增强的中文医学命名实体识别

刘威1, 马磊1,*(), 李凯2, 李蓉3   

  1. 1. 昆明理工大学信息工程与自动化学院, 云南 昆明 650500
    2. 云南省第一人民医院信息科, 云南 昆明 650500
    3. 云南省第一人民医院科研科, 云南 昆明 650500
  • 通讯作者: 马磊
  • 基金资助:
    国家自然科学基金(62266025); 云南省重大科技专项计划项目(202202AD080004); 云南省重大科技专项计划项目(202202AE090008); 云南省基础研究计划(昆医联合专项)(202201AY070001-258)

Abstract:

Chinese Medical Named Entity Recognition(CMNER) focuses on extracting entities from unstructured Chinese medical texts. Current character-based CMNER models inadequately address the distinct features of Chinese characters from various angles, thereby limiting their efficacy in CMNER applications. To address this, a model leveraging multigranular glyph information enhancement for Chinese medical named entity recognition is introduced. This model integrates the glyph spatial structure and radical representation of Chinese characters, aligning them with domain-specific lexicon-based word information. This approach enriches the semantic and boundary potential of characters. Through a gating mechanism, the model effectively combines domain-specific terms with the multifaceted glyph features of Chinese characters, ensuring comprehensive consideration of both domain relevance and intrinsic character details, thereby enhancing its capacity for medical entity recognition. The model employs multigranular glyph-enhanced character representations in the Bidirectional Long Short-Term Memory(BiLSTM) and Conditional Random Field(CRF) layers for contextual encoding and label decoding, respectively. Experimental results demonstrate that the proposed model surpasses the best baseline model, achieving an increase in F1 scores of 1.04% and 0.62% on the IMCS21 and CMeEE datasets, respectively. Ablation studies further confirm the efficacy of each component, highlighting the model's superiority in recognizing Chinese medical named entities.

Key words: named entity recognition, medical domain, glyph structure, gating mechanism, domain lexicon

摘要:

中文医学命名实体识别(CMNER)旨在从中文非结构化医学文本中提取实体。现有的基于字符的CMNER模型没有从不同角度全面考虑汉字的特点,限制了其应用于CMNER的性能。基于此,提出基于多粒度字形增强的中文医学命名实体识别模型。对于输入的句子,结合汉字的字形空间结构和偏旁部首的表示,同时根据相应的领域词典来匹配字符的领域词信息,增强字符的语义和潜在边界信息,使模型获得更好的实体识别能力;通过门控机制整合领域词和汉字的字形多粒度特征,综合考虑汉字的领域信息和汉字底层信息,从而具有更好的感知医学实体的能力。在此基础上,将多粒度字形增强的字符表示输入到双向长短记忆和条件随机场层,分别进行上下文编码和标签解码。实验结果表明,本文模型较于最佳基线模型在IMCS21和CMeEE数据集上的F1值分别提升了1.04%和0.62%。此外,通过消融实验验证了该模型的每个组成部分的有效性,在识别中文医学命名实体时具有较好的识别性能。

关键词: 命名实体识别, 医学领域, 字形结构, 门控机制, 领域词典