基于Wobert与对抗学习的中文命名实体识别

doi:10.19678/j.issn.1000-3428.0068258

摘要/Abstract

摘要：

由于自然语言处理(NLP)将中文命名实体识别(NER)任务建模为序列标注任务, 将文本中每个字符映射至一个标签, 每个字符相对独立且信息有限, 因此在NER领域词汇信息的加入能够解决字符间缺乏联系的问题。针对现有中文NER模型多需要额外构建词汇表、提取词汇信息方式繁琐、词级嵌入与字级嵌入因来源不同导致信息难以融合的问题, 提出一种基于Wobert与对抗学习的中文NER模型ALWAE-BiLSTM-CRF。与传统预训练模型相比, Wobert预训练模型在预训练阶段就已将文本分词, 充分学习了词与字两个层次的信息, 因此ALWAE-BiLSTM-CRF通过Wobert预训练模型获取字符词向量, 再使用Wobert分词器获取预训练模型中已存在的词汇向量, 接着使用BiLSTM模型获取两者的时序信息, 随后通过多头注意力机制将词汇级别的信息要素融入字符词向量, 同时通过对抗学习攻击生成对抗样本以增强模型泛化性, 最后使用条件随机场(CRF)层对结果进行约束, 获得最佳的预测序列。在Resume数据集与瓷器领域的自建数据集Porcelain上进行对比实验与消融实验, 结果表明, ALWAE-BiLSTM-CRF模型的F1值分别达到97.21%与85.7%, 证明了其在中文NER任务中的有效性。

关键词: 深度学习, 命名实体识别, 注意力机制, 特征融合, 条件随机场

Abstract:

Natural Language Processing (NLP) models the Chinese Named Entity Recognition (NER) task as a sequence annotation task and maps each character in the text to a label. Each character is relatively independent and has limited information. Therefore, the addition of vocabulary information to the NER field can solve the problem of the lack of connections between characters. To address the challenges of existing Chinese NER models that require additional vocabulary construction, employ a cumbersome extraction process of vocabulary information, and have difficulties integrating information due to different sources of word-level embedding, this study proposes a Chinese NER model based on Wobert and adversarial learning named ALWAE-BiLSTM-CRF. Unlike traditional pre-training models, the Wobert pre-training model segments the text in advance (i.e., during the pre-training stage), thereby fully learning information at both the word and character levels. Accordingly, the proposed model obtains character word vectors through the Wobert pre-training model and then uses the Wobert word splitter to obtain the existing vocabulary vector in the pre-training model. The proposed model next uses the BiLSTM model to obtain the temporal information of the two. The model then utilizes a multi-head attention mechanism to integrate vocabulary-level information elements into the character word vector while simultaneously generating adversarial samples through adversarial learning attacks to enhance model generalization. Finally, the proposed model utilizes a Conditional Random Field (CRF) layer to constrain the results and obtain the best prediction sequence. The study conducted comparative and ablation experiments on the Resume and self-built Porcelain datasets in the field of porcelains, the results show that the ALWAE-BiLSTM-CRF model achieves 97.21% and 85.7% F1 values on the two datasets, proving its effectiveness in the Chinese NER task.

Key words: deep learning, Named Entity Recognition(NER), attention mechanism, feature fusion, Conditional Random Field(CRF)

倪渊, 廖世豪, 张健. 基于Wobert与对抗学习的中文命名实体识别[J]. 计算机工程, 2024, 50(11): 119-129.

NI Yuan, LIAO Shihao, ZHANG Jian. Chinese Named Entity Recognition Based on Wobert and Adversarial Learning[J]. Computer Engineering, 2024, 50(11): 119-129.

https://www.ecice06.com/CN/Y2024/V50/I11/119

图/表 12

图1 ALWAE-BiLSTM-CRF模型结构

Fig.1 Structure of ALWAE-BiLSTM-CRF model

图2 Wobert输入实例

Fig.2 Example of Wobert input

图3 LSTM单元结构

Fig.3 LSTM unit structure

图4 多头自注意力模型

Fig.4 Multi-head self-attention model

图5 Porcelain数据集部分展示

Fig.5 Partial display of Porcelain dataset

图6 Resume数据集部分展示

Fig.6 Partial display of Resume dataset

参考文献 34

1	TJONG KIM SANG E F, DE MEULDER F. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition[C]//Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003. [S. l. ]: Association for Computational Linguistics, 2003: 142-147.
2	HUANG S B, SHA Y P, LI R S. A Chinese named entity recognition method for small-scale dataset based on lexicon and unlabeled data. Multimedia Tools and Applications, 2023, 82(2): 2185- 2206. doi: 10.1007/s11042-022-13377-y
3	LI L Y, DAI Y, TANG D Y, et al. MarkBERT: marking word boundaries improves Chinese BERT[EB/OL]. [2023-07-05]. https://arxiv.org/abs/2203.06378.
4	ZHU Y, WANG G. CAN-NER: convolutional attention network for Chinese named entity recognition[C]//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. [S. l. ]: Association for Computational Linguistics, 2019: 3384-3393.
5	CHIU J P C, NICHOLS E. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 2016, 4, 357- 370. doi: 10.1162/tacl_a_00104
6	SAITO K, NAGATA M. Multi-language named-entity recognition system based on HMM[C]//Proceedings of ACL 2003 Workshop on Multilingual and Mixed-Language Named Entity Recognition. [S. l. ]: Association for Computational Linguistics, 2003: 41-48.
7	FENG Y, SUN L, LV Y. Chinese word segmentation and named entity recognition based on conditional random fields models[C]//Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. [S. l. ]: Association for Computational Linguistics, 2006: 181-184.
8	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. [S. l. ]: Association for Computational Linguistics, 2019: 4171-4186.
9	CUI Y M, CHE W X, LIU T, et al. Revisiting pre-trained models for Chinese natural language processing[C]//Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020. [S. l. ]: Association for Computational Linguistics, 2020: 657-668.
10	LI D, LONG J, QU J, et al. Chinese clinical named entity recognition with ALBERT and MHA mechanism. Evidence-Based Complementary and Alternative Medicine, 2022, 2022, 2056039.
11	张付领. 结合ERNIE2.0的医疗中文命名实体识别模型. 电子设计工程, 2023, 31(4): 38- 42.
	ZHANG F L. Medical Chinese named entity recognition model combined with ERNIE2.0. Electronic Design Engineering, 2023, 31(4): 38- 42.
12	LIU Y H, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[EB/OL]. [2023-07-05]. http://arxiv.org/abs/1907.11692v1.
13	HUANG Z H, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[EB/OL]. [2023-07-05]. https://arxiv.org/abs/1508.01991.
14	LIU S, YANG H, LI J Y, et al. Chinese named entity recognition method in history and culture field based on BERT. International Journal of Computational Intelligence Systems, 2021, 14(1): 163. doi: 10.1007/s44196-021-00019-8
15	ZHAO Z, YANG Z, LUO L, et al. Disease named entity recognition from biomedical literature using a novel convolutional neural network. BMC Medical Genomics, 2017, 10(5): 73.
16	AGUILAR G, MAHARJAN S, LÓPEZ MONROY A P, et al. A multi-task approach for named entity recognition in social media data[C]//Proceedings of the 3rd Workshop on Noisy User-generated Text. [S. l. ]: Association for Computational Linguistics, 2017: 148-153.
17	YANG Z C, YANG D Y, DYER C, et al. Hierarchical attention networks for document classification[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. [S. l. ]: Association for Computational Linguistics, 2016: 1480-1489.
18	LIU W, FU X Y, ZHANG Y, et al. Lexicon enhanced Chinese sequence labeling using BERT adapter[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. [S. l. ]: Association for Computational Linguistics, 2021: 5847-5858.
19	LARSEN B, AONE C. Fast and effective text mining using linear-time document clustering[C]//Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 1999: 16-22.
20	MA R, PENG M, ZHANG Q, et al. Simplify the usage of lexicon in Chinese NER[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S. l. ]: Association for Computational Linguistics, 2020: 5951-5960.
21	LI X N, YAN H, QIU X P, et al. FLAT: Chinese NER using flat-lattice transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S. l. ]: Association for Computational Linguistics, 2020: 6836-6842.
22	GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[EB/OL]. [2023-07-05]. http://arxiv.org/abs/1406.2661v1.
23	CAO P F, CHEN Y B, LIU K, et al. Adversarial transfer learning for Chinese named entity recognition with self-attention mechanism[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. [S. l. ]: Association for Computational Linguistics, 2018: 182-192.
24	YASUNAGA M, KASAI J, RADEV D. Robust multilingual part-of-speech tagging via adversarial training[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. [S. l. ]: Association for Computational Linguistics, 2018: 976-986.
25	XU Z. RoBERTa-wwm-ext fine-tuning for Chinese text classification[EB/OL]. [2023-07-05]. http://arxiv.org/abs/2103.00492v1.
26	苏剑林. 提速不掉点: 基于词颗粒度的中文WoBERT[EB/OL]. [2023-07-05]. http://kexue.fm/archives/7758.
	SU J L. Speed up without dropping points: Chinese WoBERT based on word granularity[EB/OL]. [2023-07-05]. http://kexue.fm/archives/7758. (in Chinese)
27	GUI T, ZOU Y C, ZHANG Q, et al. A lexicon-based graph neural network for Chinese NER[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. [S. l. ]: Association for Computational Linguistics, 2019: 1040-1050.
28	GUI T, MA R T, ZHANG Q, et al. CNN-based Chinese NER with lexicon rethinking[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence. [S. l. ]: Association for Computational Linguistics, 2019: 4982-4988.
29	WU S, SONG X N, FENG Z H, et al. NFLAT: non-flat-lattice transformer for Chinese named entity recognition[EB/OL]. [2023-07-05]. http://arxiv.org/abs/2205.05832v3.
30	ZHU E W, LI J P. Boundary smoothing for named entity recognition[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. [S. l. ]: Association for Computational Linguistics, 2022: 7096-7108.
31	于右任, 张仰森, 蒋玉茹, 等. 融合多粒度语言知识与层级信息的中文命名实体识别模型. 计算机应用, 2024, 44(6): 1706- 1712.
	YU Y R, ZHANG Y S, JIANG Y R, et al. Chinese named entity recognition model incorporating multi-granularity linguistic knowledge and hierarchical information. Journal of Computer Applications, 2024, 44(6): 1706- 1712.
32	杨长沛, 廖列法. 基于门控空洞卷积特征融合的中文命名实体识别. 计算机工程, 2023, 49(8): 85- 95. doi: 10.19678/j.issn.1000-3428.0065455
	YANG C P, LIAO L F. Chinese named entity recognition based on dilated gated convolution feature fusion. Computer Engineering, 2023, 49(8): 85- 95. doi: 10.19678/j.issn.1000-3428.0065455
33	廖梦, 贾真, 李天瑞. 基于标签信息融合与多任务学习的中文命名实体识别[J]. 计算机科学, 2024, 51(3): 198-204.
	LIAO M, JIA Z, LI T R. Chinese named entity recognition based on label information fusion and multi-task learning[J]. Computer Science, 2024, 51(3): 198-204. (in Chinese)
34	王庆人, 王银子, 仲红, 等. 面向中文的字词组合序列实体识别方法. 清华大学学报(自然科学版), 2023, 63(9): 1326- 1338.
	WANG Q R, WANG Y Z, ZHONG H, et al. Chinese-oriented entity recognition method of character vocabulary combination sequence. Journal of Tsinghua University (Science and Technology), 2023, 63(9): 1326- 1338.

[1]	魏嵬, 丁香香, 郭梦星, 杨钊, 刘辉. 文本相似度计算方法综述[J]. 计算机工程, 2024, 50(9): 18-32.
[2]	屈潇雅, 李兵, 温立强. 面向行政执法案件文本的事件抽取研究[J]. 计算机工程, 2024, 50(9): 63-71.
[3]	李俊俊, 董建刚, 李坤. 基于Kubernetes的集群节能策略研究[J]. 计算机工程, 2024, 50(9): 82-91.
[4]	党小超, 刘涧, 董晓辉, 祝忠彦, 李芬芳. 面向不平衡数据的机械设备故障命名实体识别[J]. 计算机工程, 2024, 50(9): 104-112.
[5]	林畅, 郭伟, 任哲聪, 金海波. 基于Transformer的目标跟踪与分割统一算法[J]. 计算机工程, 2024, 50(9): 130-141.
[6]	李泽霖, 吕兆峰, 陈富强, 李克. 基于多跳信息融合的实体对齐模型[J]. 计算机工程, 2024, 50(9): 142-152.
[7]	王汝英, 马嘉骏, 董建强, 刘万龙, 张海涛, 尹凯, 赵博超. 基于MTS-BiGRU-DMHSA的工业负荷预测方法[J]. 计算机工程, 2024, 50(9): 169-178.
[8]	李俊仪, 李向阳, 龙朝勋, 李海燕, 李红松, 余鹏飞. 基于多级区域选择与跨层特征融合的野生菌分类[J]. 计算机工程, 2024, 50(9): 179-188.
[9]	朱凯, 李理, 张彤, 江晟, 别一鸣. 基于Transformer的多阶段运动模糊图像修复网络[J]. 计算机工程, 2024, 50(9): 276-285.
[10]	张天鹏, 韩晶, 吕学强. 基于多任务学习的超分辨率辅助小目标检测[J]. 计算机工程, 2024, 50(9): 304-312.
[11]	郭敏, 张熙涵, 李阳. 融合注意力的教师互一致性半监督医学图像分割[J]. 计算机工程, 2024, 50(9): 313-323.
[12]	高煜宝, 文志诚. 基于注意力机制的双路解码器图像去噪方法[J]. 计算机工程, 2024, 50(9): 324-332.
[13]	曾钰琦, 刘博, 钟柏昌, 钟瑾. 智慧教育下基于改进YOLOv8的学生课堂行为检测算法[J]. 计算机工程, 2024, 50(9): 344-355.
[14]	张华青, 夏张涛, 陆晓庆, 童基均. 基于字形特征的血管外科命名实体识别[J]. 计算机工程, 2024, 50(8): 13-21.
[15]	饶日昕, 王怡文, 曾砺志, 童心恬, 赵海涛. 面向废旧电缆检测的轻量化网络模型[J]. 计算机工程, 2024, 50(8): 22-30.

选择文件类型/文献管理软件名称

选择包含的内容