基于字词融合与对抗训练的行业人物实体识别

doi:10.19678/j.issn.1000-3428.0064584

摘要/Abstract

摘要： 行业人物命名实体识别旨在从行业人物语料中抽取出有效的实体信息，是深度挖掘行业人物信息资源的基础性和关键性任务。由于主流的命名实体识别模型未充分利用词信息特征，导致在对具有特色的行业人物实体进行识别时语义、实体边界识别不准确。提出一种基于字词融合与对抗训练的行业人物实体识别模型。利用RoBERTa-wwm-ext预训练模型提取句子的字特征，并融合词典构造出句子的词特征。在字词融合的向量表示上添加扰动生成对抗样本，将融合向量表示与对抗样本作为训练数据输入双向长短期记忆神经网络(BiLSTM)学习上下文信息，并通过条件随机场(CRF)推理最优的序列标注结果。根据行业人物文本特点设计命名实体标注方案，构建数据集进行实验验证。实验结果表明，该模型在测试集上的精确率、召回率、F1值分别达到92.94%、94.35%、93.64%，相较于BERT-BiLSTM-CRF模型分别提升3.68、1.24、2.39个百分点。

关键词: 命名实体识别, 行业人物, 字词融合, 对抗训练, 预训练模型

Abstract: Named Entity Recognition（NER） of industry figures involves the extraction of effective entity information from a corpus of industry figures，which is a fundamental and critical task for the deep mining of industry figures information resources.However，the mainstream NER models suffer from semantic and entity boundary recognition inaccuracies when recognizing industry figures entities with characteristics owing to the underutilization of word information features. Therefore， this paper proposes an industry figures entity recognition model based on character and word fusion and adversarial training. First，the character features of the sentence are extracted using the RoBERTa-wwm-ext pre-training model，and the word features of the sentence are constructed by fusing the lexicon. Then，adversarial samples are generated by adding perturbations to the vector representation of the character and word fusion. The fused vector representation and adversarial samples are input into a BiLSTM neural network to further extract contextual information. Finally，the optimal annotation sequence is obtained using the Conditional Random Field（CRF）. A dataset is built for experimental validation and labeled according to the characteristics of industry figures texts. The experimental results for the model on the test set show that precision reaches 92.94%，recall reaches 94.35%，and F1 score reaches 93.64%. Compared with the mainstream BERT-BiLSTM-CRF model，the precision，recall，and F1 score are improved by 3.68，1.24，and 2.39 percentage points，respectively.

Key words: Named Entity Recognition（NER）, industry figures, character and word fusion, adversarial training, pre-training model

中图分类号:

TP18

朱红, 牛浩然, 朱彤. 基于字词融合与对抗训练的行业人物实体识别[J]. 计算机工程, 2023, 49(5): 56-62.

ZHU Hong, NIU Haoran, ZHU Tong. Entity Recognition of Industry Figures Based on Character and Word Fusion and Adversarial Training[J]. Computer Engineering, 2023, 49(5): 56-62.

https://www.ecice06.com/CN/Y2023/V49/I5/56

图/表 8

20230515184908

20230515184911

20230515184935

20230515184938

20230515184942

20230515184945

20230515184948

20230515184952

参考文献

[1] 仇瑜,程力.面向财税领域的实体识别与标注研究[J].计算机工程,2020,46(5):312-320. QIU Y,CHENG L.Research on entity recognition and tagging in fiscal and taxation domain[J].Computer Engineering,2020,46(5):312-320.(in Chinese)
[2] 张云秋,汪洋,李博诚.基于RoBERTa-wwm动态融合模型的中文电子病历命名实体识别[J].数据分析与知识发现,2022,6(S1):242-250. ZHANG Y Q,WANG Y,LI B C.Identifying named entities of Chinese electronic medical records based on RoBERTa-wwm dynamic fusion model[J].Data Analysis and Knowledge Discovery,2022,6(S1):242-250.(in Chinese)
[3] 顾亦然,霍建霖,杨海根,等.基于BERT的电机领域中文命名实体识别方法[J].计算机工程,2021,47(8):78-83,92. GU Y R,HUO J L,YANG H G,et al.BERT-based Chinese named entity recognition method in motor field[J].Computer Engineering,2021,47(8):78-83,92.(in Chinese)
[4] 薛振宇,线岩团,余正涛,等.融合词典与对抗迁移的越南语事件实体识别[J].计算机工程,2022,48(3):107-114,145. XUE Z Y,XIAN Y T,YU Z T,et al.Vietnamese event entity recognition combining dictionary and adversarial transfer[J].Computer Engineering,2022,48(3):107-114,145.(in Chinese)
[5] 崔丽平,古丽拉·阿东别克,王智悦.基于有向图模型的旅游领域命名实体识别[J].计算机工程,2022,48(2):306-313. CUI L P,Altenbek Gulila,WANG Z Y.Named entity recognition in tourism based on directed graph model[J].Computer Engineering,2022,48(2):306-313.(in Chinese)
[6] 李薇,肖仰华,汪卫.基于中文知识图谱的人物实体识别[J].计算机工程,2017,43(3):225-231,240. LI W,XIAO Y H,WANG W.People entity recognition based on Chinese knowledge graph[J].Computer Engineering,2017,43(3):225-231,240.(in Chinese)
[7] 周娟娟,李泽锋,刘竟一.基于知识图谱的干部人事档案知识化服务研究[J].档案管理,2021(6):87-89. ZHOU J J,LI Z F,LIU J Y.Research on knowledge service of cadres' personnel files based on knowledge map[J].Archives Management,2021(6):87-89.(in Chinese)
[8] 张纯鹏,辜希武,李瑞轩,等.BERT辅助金融领域人物关系图谱构建[J].计算机科学与探索,2022,16(1):137-143. ZHANG C P,GU X W,LI R X,et al.Construction method for financial personal relationship graphs using BERT[J].Journal of Frontiers of Computer Science and Technology,2022,16(1):137-143.(in Chinese)
[9] 郭军成,万刚,胡欣杰,等.基于BERT的中文简历命名实体识别[J].计算机应用,2021,41(S1):15-19. GUO J C,WAN G,HU X J,et al.Chinese resume named entity recognition based on BERT[J].Journal of Computer Applications,2021,41(S1):15-19.(in Chinese)
[10] 王传涛,丁林楷,杨学鑫,等.基于BERT的中文电子简历命名实体识别[J].中国科技论文,2021,16(7):770-775,782. WANG C T,DING L K,YANG X X,et al.Recognition of named entity in Chinese e-resume based on BERT[J].China Sciencepaper,2021,16(7):770-775,782.(in Chinese)
[11] 王俊,王修来,栾伟先,等.基于BERT模型的科研人才领域命名实体识别[J].计算机技术与发展,2021,31(11):21-27. WANG J,WANG X L,LUAN W X,et al.Research on named entity recognition of scientific research talents field based on BERT model[J].Computer Technology and Development,2021,31(11):21-27.(in Chinese)
[12] 沈科杰,黄焕婷,化柏林.基于公开履历数据的人物知识图谱构建[J].数据分析与知识发现,2021,5(7):81-90. SHEN K J,HUANG H T,HUA B L.Constructing knowledge graph with public resumes[J].Data Analysis and Knowledge Discovery,2021,5(7):81-90.(in Chinese)
[13] SZEGEDY C,ZAREMBA W,SUTSKEVER I,et al.Intriguing properties of neural networks[EB/OL].[2022-03-02].https://arxiv.org/abs/1312.6199.
[14] 张吉祥,张祥森,武长旭,等.知识图谱构建技术综述[J].计算机工程,2022,48(3):23-37. ZHANG J X,ZHANG X S,WU C X,et al.Survey of knowledge graph construction techniques[J].Computer Engineering,2022,48(3):23-37.(in Chinese)
[15] 杨华.基于最大熵模型的中文命名实体识别方法研究[D].哈尔滨:哈尔滨工程大学,2008. YANG H.Research on method of Chinese named entity recognition based on maximum entropy model[D].Harbin:Harbin Engineering University,2008.(in Chinese)
[16] EKBAL A,BANDYOPADHYAY S.Named entity recognition using support vector machine:a language independent approach[J].International Journal of Computer,Electrical,Automation,Control and Information Engineering,2010,4(3):589-604.
[17] SAITO K,NAGATA M.Multi-language named-entity recognition system based on HMM[C]//Proceedings of the Workshop on Multilingual and Mixed-language Named Entity Recognition.New York,USA:ACM Press,2003:41-48.
[18] LAFFERTY J D,MCCALLUM A,PEREIRA F C N.Conditional random fields:probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th International Conference on Machine Learning.New York,USA:ACM Press,2001:282-289.
[19] HUANG Z H,XU W,YU K.Bidirectional LSTM-CRF models for sequence tagging[EB/OL].[2022-03-02].https://arxiv.org/abs/1508.01991.
[20] DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[EB/OL].[2022-03-02].https://arxiv.org/abs/1810.04805.
[21] ZHANG Y,YANG J.Chinese NER using lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Stroudsburg,USA:Association for Computational Linguistics,2018:1-10.
[22] MA R T,PENG M L,ZHANG Q,et al.Simplify the usage of lexicon in Chinese NER[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Stroudsburg,USA:Association for Computational Linguistics,2020:5951-5960.
[23] GOODFELLOW I J,SHLENS J,SZEGEDY C.Explaining and harnessing adversarial examples[C]//Proceedings of International Conference on Machine Learning.Lille,France:International Machine Learning Society,2015:1-10.
[24] MIYATO T,DAI A M,GOODFELLOW I.Adversarial training methods for semi-supervised text classification[EB/OL].[2022-03-02].https://arxiv.org/abs/1605.07725.
[25] 董哲,邵若琦,陈玉梁,等.基于BERT和对抗训练的食品领域命名实体识别[J].计算机科学,2021,48(5):247-253. DONG Z,SHAO R Q,CHEN Y L,et al.Named entity recognition in food field based on BERT and adversarial training[J].Computer Science,2021,48(5):247-253.(in Chinese)

选择文件类型/文献管理软件名称

选择包含的内容