Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2023, Vol. 49 ›› Issue (10): 305-312. doi: 10.19678/j.issn.1000-3428.0065880

• Development Research and Engineering Application • Previous Articles     Next Articles

Biomedical Named Entity Recognition Method Based on Word Meaning Enhancement

Mengxuan CHEN1,2, Yanping CHEN1,2,*, Ying HU1,2, Ruizhang HUANG1,2, Yongbin QIN1,2   

  1. 1. State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
    2. College of Computer Science and Technology, Guizhou University, Guiyang 550025, China
  • Received:2022-09-29 Online:2023-10-15 Published:2023-01-06
  • Contact: Yanping CHEN

基于词义增强的生物医学命名实体识别方法

陈梦萱1,2, 陈艳平1,2,*, 扈应1,2, 黄瑞章1,2, 秦永彬1,2   

  1. 1. 贵州大学 公共大数据国家重点实验室, 贵阳 550025
    2. 贵州大学 计算机科学与技术学院, 贵阳 550025
  • 通讯作者: 陈艳平
  • 作者简介:

    陈梦萱(1997—),女,硕士研究生,主研方向为自然语言处理、命名实体识别

    扈应,博士研究生

    黄瑞章,教授、博士

    秦永彬,教授、博士

  • 基金资助:
    国家自然科学基金(62166007)

Abstract:

Biomedical Named Entity Recognition(BioNER), as a core task of biomedical text mining, provides strong support for downstream tasks. There are more unregistered words in biomedical data than in the general domain. Existing BioNER methods usually use the method of splitting unregistered words into morphemes to alleviate the problem of insufficient information of unregistered words; however, the internal information of words is also split, resulting in label inconsistency. Thus, cross-entity label problems are prone to occur in label prediction for morphemes. In addition, the segmentation of words into morphemes leads to longer sentence lengths, which aggravates the problem of gradient disappearance during training. To address the above problems, a BioNER method that uses the Bidirectional Long Short-Term Memory(BiLSTM)-Biaffine structure is proposed for word meaning enhancement. First, morpheme representation information is obtained through the BioBERT pre-training model. Subsequently, BiLSTM-Biaffine is used to enhance the word sense, with BiLSTM at the word level to obtain forward and backward sequence information of the morpheme and the Biaffine attention mechanism to enhance the associated information and reintegrate it into the words representation. Finally, the label sequence of the input sentence is obtained through the BiLSTM-CRF model. The experimental results show that on the BC2GM, NCBI-Disease, BC5CDR-chem, and JNLPBA datasets, the F1 scores of the method reached 84.94%, 89.07%, 92.14%, and 74.57%, respectively. Compared with mainstream sequence annotation models such as the MTM-CW and MT-BioNER, the proposed method provided an average improvement of 2.99, 1.84, 3.09, and 1.03 percentage points, respectively, verifying its effectiveness in BioNER tasks.

Key words: Biomedical Named Entity Recognition(BioNER), morpheme, word meaning enhancement, Bidirectional Long Short-Term Memory (BiLSTM) network, attention mechanism

摘要:

生物医学命名实体识别(BioNER)是生物医学文本挖掘的核心任务之一,能够为下游任务提供有力支撑。与通用领域相比,生物医学数据中存在更多的未登录词,现有BioNER方法通常将未登录词拆分为语素进行表示学习,这种方法缓解了未登录词表示信息不足的问题,但是破坏了单词的内部信息,对语素进行标签预测时容易出现标签不一致和跨实体标签问题。此外,将单词分割为语素导致句子长度变长,加重了训练中存在的梯度消失问题。提出一种通过BiLSTM-Biaffine结构进行词义增强的BioNER方法。通过BioBERT预训练模型获取语素表示信息,使用BiLSTM-Biaffine进行词义增强,在单词层面利用BiLSTM分别获取语素的前向和后向序列信息,采用Biaffine注意力机制增强其关联信息并重新融合为单词表示,最后通过BiLSTM-CRF模型获取输入句子的标签序列。实验结果表明,在数据集BC2GM、NCBI-Disease、BC5CDR-chem和JNLPBA上,该方法的F1值分别达到84.94%、89.07%、92.14%和74.57%, 与主流序列标注模型MTM-CW、MT-BioNER等相比平均分别提高了2.99、1.84、3.09和1.03个百分点,验证了所提方法在BioNER任务中的有效性。

关键词: 生物医学命名实体识别, 语素, 词义增强, 双向长短期记忆网络, 注意力机制