基于ALBERT-BGRU-CRF的中文命名实体识别方法

doi:10.19678/j.issn.1000-3428.0061630

摘要/Abstract

摘要： 命名实体识别是知识图谱构建、搜索引擎、推荐系统等上层自然语言处理任务的重要基础，中文命名实体识别是对一段文本序列中的专有名词或特定命名实体进行标注分类。针对现有中文命名实体识别方法无法有效提取长距离语义信息及解决一词多义的问题，提出一种基于ALBERT-双向门控循环单元（BGRU）-条件随机场（CRF）模型的中文命名实体识别方法。使用ALBERT预训练语言模型对输入文本进行词嵌入获取动态词向量，有效解决了一词多义的问题。采用BGRU提取上下文语义特征进一步理解语义，获取长距离词之间的语义特征。将拼接后的向量输入至CRF层并利用维特比算法解码，降低错误标签输出概率。最终得到实体标注信息，实现中文命名实体识别。实验结果表明，ALBERT-BGRU-CRF模型在MSRA语料库上的中文命名实体识别准确率和召回率分别达到95.16%和94.58%，同时相比于片段神经网络模型和CNN-BiLSTM-CRF模型的F1值提升了4.43和3.78个百分点。

关键词: 命名实体识别, 预训练语言模型, 双向门控循环单元, 条件随机场, 词向量, 深度学习

Abstract: Named Entity Recognition(NER) is an important basis for upper-level natural language processing tasks such as knowledge graph construction, search engines, and recommendation systems.Chinese NER labels and classifies proper nouns or specific named entities in a text sequence.Aiming at the problem that the existing Chinese NER methods cannot effectively extract long-distance semantic information and solve the problem of polysemy, this study proposes a Chinese NER method based on ALBERT pre-training language model, Bidirectional Gated Recurrent Unit(BGRU) and Conditional Random Field(CRF), called ALBERT-BGRU-CRF model.First, the ALBERT pre-trained language model performs word embedding on the input text to obtain dynamic word vectors, which can effectively solve the polysemy problem.Second, BGRU extracts contextual semantic features to further understand semantics and obtain semantic features between long-distance words.Finally, the concatenated vector is input to the CRF layer and decoded using the Viterbi algorithm to reduce the probability of wrongly labelling the output.Then, the entity annotation information is obtained, and the Chinese NER task is completed.The experimental results show that the Chinese NER accuracy and recall rate of the ALBERT-BGRU-CRF model on the MSRA corpus reach 95.16% and 94.58%, respectively.Simultaneously, compared with the fragment neural network model and the CNN-BiLSTM-CRF model, the F1 value of the ALBERT-BGRU-CRF model has increased by 4.43 and 3.78 percentage points.

Key words: Named Entity Recognition(NER), pre-trained language model, Bidirectional Gated Recurrent Unit(BGRU), Conditional Random Field(CRF), word vector, deep learning

中图分类号:

TP391

李军怀, 陈苗苗, 王怀军, 崔颖安, 张爱华. 基于ALBERT-BGRU-CRF的中文命名实体识别方法[J]. 计算机工程, 2022, 48(6): 89-94,106.

LI Junhuai, CHEN Miaomiao, WANG Huaijun, CUI Ying'an, ZHANG Aihua. Chinese Named Entity Recognition Method Based on ALBERT-BGRU-CRF[J]. Computer Engineering, 2022, 48(6): 89-94,106.

https://www.ecice06.com/CN/Y2022/V48/I6/89

图/表 8

20220625174617

20220625174621

20220625174625

20220625174629

20220625174637

20220625174641

20220625174645

20220625174649

参考文献

[1] 陈曙东, 欧阳小叶.命名实体识别技术综述[J].无线电通信技术, 2020, 46(3):251-260. CHEN S D, OUYANG X Y.Overview of named entity recognition technology[J].Radio Communications Technology, 2020, 46(3):251-260.(in Chinese)
[2] XIE R B, LIU Z Y, JIA J, et al.Representation learning of knowledge graphs with entity descriptions[C]//Proceedings of the 13th AAAI Conference on Artificial Intelligence.Palo Alto, USA:AAAI Press, 2016:31-37.
[3] MCCALLUM A, FREITAG D, PEREIRA F C.Maximum entropy Markov models for information extraction and segmentation[EB/OL].[2021-04-08].http://www.doczj.com/doc/ba02ef20482fb4daa58d4b5e.html.
[4] LAFFERTY J D, MCCALLUM A, PEREIRA F C N.Conditional random fields:probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th International Conference on Machine Learning.New York, USA:ACM Press, 2001:282-289.
[5] COLLOBERT R, WESTON J, BOTTOU L, et al.Natural language processing(almost) from scratch[J].Journal of Machine Learning Research, 2011, 12:2493-2537.
[6] CHIU J P C, NICHOLS E.Named entity recognition with bidirectional LSTM-CNNs[J].Transactions of the Association for Computational Linguistics, 2016, 4:357-370.
[7] MA X Z, HOVY E.End-to-end sequence labeling via Bi-directional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Stroudsburg, USA:Association for Computational Linguistics, 2016:1064-1074.
[8] MIKOLOV T, CHEN K, CORRADO G, et al.Efficient estimation of word representations in vector space[EB/OL].[2021-04-08].https://arxiv.org/abs/1301.3781.
[9] PETERS M, NEUMANN M, IYYER M, et al.Deep contextualized word representations[C]//Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg, USA:Association for Computational Linguistics, 2018:2227-2237.
[10] DEVLIN J, CHANG M W, LEE K, et al.BERT:pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg, USA:Association for Computational Linguistics, 2019:4171-4186.
[11] LAN Z Z, CHEN M D, GOODMAN S, et al.ALBERT:a lite BERT for self-supervised learning of language representations[EB/OL].[2021-04-08].https://arxiv.org/abs/1909.11942.
[12] 刘玉娇, 琚生根, 李若晨, 等.基于深度学习的中文微博命名实体识别[J].四川大学学报(工程科学版), 2016, 48(S2):142-146. LIU Y J, JU S G, LI R C, et al.Named entity recognition in Chinese micro-blog based on deep learning[J].Journal of Sichuan University(Engineering Science Edition), 2016, 48(S2):142-146.(in Chinese)
[13] 张海楠, 伍大勇, 刘悦, 等.基于深度神经网络的中文命名实体识别[J].中文信息学报, 2017, 31(4):28-35. ZHANG H N, WU D Y, LIU Y, et al.Chinese named entity recognition based on deep neural network[J].Journal of Chinese Information Processing, 2017, 31(4):28-35.(in Chinese)
[14] 李雁群, 何云琪, 钱龙华, 等.基于维基百科的中文嵌套命名实体识别语料库自动构建[J].计算机工程, 2018, 44(11):76-82. LI Y Q, HE Y Q, QIAN L H, et al.Automatic construction of Chinese nested named entity recognition corpus based on Wikipedia[J].Computer Engineering, 2018, 44(11):76-82.(in Chinese)
[15] JIA Y Z, XU X B.Chinese named entity recognition based on CNN-BiLSTM-CRF[C]//Proceedings of the 9th International Conference on Software Engineering and Service Science.Washington D.C., USA:IEEE Press, 2018:1-4.
[16] ZHANG Y, YANG J.Chinese NER using lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Stroudsburg, USA:Association for Computational Linguistics, 2018:1-10.
[17] 王蕾, 谢云, 周俊生, 等.基于神经网络的片段级中文命名实体识别[J].中文信息学报, 2018, 32(3):84-90, 100. WANG L, XIE Y, ZHOU J S, et al.Segment-level Chinese named entity recognition based on neural network[J].Journal of Chinese Information Processing, 2018, 32(3):84-90, 100.(in Chinese)
[18] 石春丹, 秦岭.基于BGRU-CRF的中文命名实体识别方法[J].计算机科学, 2019, 46(9):237-242. SHI C D, QIN L.Chinese named entity recognition method based on BGRU-CRF[J].Computer Science, 2019, 46(9):237-242.(in Chinese)
[19] 赵丰, 黄健, 张中杰.LAC-DGLU:基于CNN和注意力机制的命名实体识别模型[J].计算机科学, 2020, 47(11):212-219. ZHAO F, HUANG J, ZHANG Z J.LAC-DGLU:named entity recognition model based on CNN and attention mechanism[J].Computer Science, 2020, 47(11):212-219.(in Chinese)
[20] 李妮, 关焕梅, 杨飘, 等.基于BERT-IDCNN-CRF的中文命名实体识别方法[J].山东大学学报(理学版), 2020, 55(1):102-109. LI N, GUAN H M, YANG P, et al.BERT-IDCNN-CRF for named entity recognition in Chinese[J].Journal of Shandong University (Natural Science), 2020, 55(1):102-109.(in Chinese)
[21] LEVOW G.The third international Chinese language processing bakeoff:word segmentation and named entity recognition[C]//Proceedings of the 15th SIGHAN Workshop on Chinese Language Processing.Berlin, Germany:Springer, 2006:108-117.
[22] 罗凌, 杨志豪, 宋雅文, 等.基于笔画ELMo和多任务学习的中文电子病历命名实体识别研究[J].计算机学报, 2020, 43(10):1943-1957. LUO L, YANG Z H, SONG Y W, et al.Chinese clinical named entity recognition based on stroke ELMo and multi-task learning[J].Chinese Journal of Computers, 2020, 43(10):1943-1957.(in Chinese)
[23] 李韧, 李童, 杨建喜, 等.基于Transformer-BiLSTM-CRF的桥梁检测领域命名实体识别[J].中文信息学报, 2021, 35(4):83-91. LI R, LI T, YANG J X, et al.Bridge inspection named entity recognition based on Transformer-BiLSTM-CRF[J].Journal of Chinese Information Processing, 2021, 35(4):83-91.(in Chinese)

选择文件类型/文献管理软件名称

选择包含的内容