基于词汇融合和依存关系的中文命名实体识别

doi:10.19678/j.issn.1000-3428.0068226

摘要/Abstract

摘要：

命名实体识别是自然语言处理领域的重要基础任务, 为关系抽取、构建知识图谱等众多下游任务提供有价值的数据支撑。针对中文命名实体识别存在分词错误、实体边界模糊和上下文依赖的难点, 以及现有方法不能充分利用词汇信息和有效提取文本内部特征等问题, 提出一种基于词汇融合和依存关系的中文命名实体识别模型。首先, 获取输入文本中每个字符的自匹配词生成词汇特征向量, 并根据字符在它的自匹配词上的位置得到词边界信息, 利用双仿射注意力机制将字符向量与词汇特征向量进行融合, 将词汇信息和词边界信息融入模型的编码过程, 从而使模型获得良好的实体识别能力; 然后, 根据依存句法建立输入文本的依存图结构, 利用图注意力网络(GAT)捕获输入文本内部依存关系特征, 增强文本内部的语义依赖信息, 同时有利于区分实体边界; 最后, 使用条件随机场(CRF)计算文本的标签。实验结果表明, 该模型在CCKS2017、OntoNote4.0和MSRA数据集上分别获得了92.10%、80.76%和95.66%的F1值, 优于对比模型。

关键词: 注意力机制, 依存关系, 词汇融合, 图注意力网络, 中文命名实体识别

Abstract:

Named entity recognition is an important foundational task in the field of natural language processing, providing valuable data support for many downstream tasks, such as relation extraction and knowledge graph construction. To address the difficulties of word segmentation errors, ambiguous entity boundaries, and contextual dependencies in Chinese named entity recognition, as well as the inability of existing methods to fully utilize lexical information and effectively extract internal text features, this paper proposes a Chinese named entity recognition method based on lexicon fusion and dependency relation. First, the self-matching words of each character in the input text are obtained to generate lexical feature vectors, and word boundary information is obtained according to the position of the character in its self-matching words. The character and lexical feature vectors are fused using biaffine attention mechanism, and the lexical and word boundary information are integrated into the encoding process of the model so that the model can achieve good entity recognition ability. Subsequently, based on dependency syntax, a dependency graph structure of the input text is established, and a Graph Attention Network (GAT) is used to capture the internal dependency features of the input text, enhance the semantic dependency information within the text, and facilitate the differentiation of entity boundaries. Finally, text labels are calculated using a Conditional Random Field (CRF). The proposed method obtains F1 values of 92.10%, 80.76%, and 95.66% on the CCKS2017, OntoNote4.0 and MSRA datasets, respectively, which are better than those of the comparison models.

Key words: attention mechanism, dependency relation, lexicon fusion, Graph Attention Network(GAT), Chinese named entity recognition

唐卓然, 柳毅. 基于词汇融合和依存关系的中文命名实体识别[J]. 计算机工程, 2024, 50(10): 145-153.

TANG Zhuoran, LIU Yi. Chinese Named Entity Recognition Based on Lexicon Fusion and Dependency Relation[J]. Computer Engineering, 2024, 50(10): 145-153.

https://www.ecice06.com/CN/Y2024/V50/I10/145

图/表 16

图1 模型结构

Fig.1 Model structure

图2 融合词汇特征的BERT模块

Fig.2 BERT module integrating lexical features

图3 字词融合模块

Fig.3 Characters and words fused module

图4 输入序列内部依存关系示例

Fig.4 Example of internal dependency relation of input sequence

图5 中文命名实体识别实例

Fig.5 Example of Chinese named entity recognition

参考文献 29

1	包振山, 宋秉彦, 张文博, 等. 基于半监督学习和规则相结合的中医古籍命名实体识别研究. 中文信息学报, 2022, 36 (6): 90- 100. URL
	BAO Z S, SONG B Y, ZHANG W B, et al. Research on named entity recognition of traditional Chinese medicine ancient books based on the combination of semi supervised learning and rules. Journal of Chinese Information Processing, 2022, 36 (6): 90- 100. URL
2	李娜. 基于条件随机场的方志古籍别名自动抽取模型构建. 中文信息学报, 2018, 32 (11): 41-48, 61. doi: 10.3969/j.issn.1003-0077.2018.11.006
	LI N. Automatic extraction of alias in ancient local chronicles based on conditional random fields. Journal of Chinese Information Processing, 2018, 32 (11): 41-48, 61. doi: 10.3969/j.issn.1003-0077.2018.11.006
3	郑亚南, 田大钢. 基于GloVe与SVM的文本分类研究. 软件导刊, 2018, 17 (6): 45-48, 52. doi: 10.11907/rjdk.172991
	ZHENG Y N, TIAN D G. Research on text classification based on GloVe and SVM. Software Guide, 2018, 17 (6): 45-48, 52. doi: 10.11907/rjdk.172991
4	GUI T, MA R T, ZHANG Q, et al. CNN-based Chinese NER with lexicon rethinking[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence. [S. l.]: International Joint Conferences on Artificial Intelligence Organization, 2019: 4982-4988.
5	COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 2011, 12, 2493- 2537. doi: 10.1016/j.chemolab.2011.03.009
6	LIN J C W, SHAO Y N, DJENOURI Y, et al. ASRNN: a recurrent neural network with an attention model for sequence labeling. Knowledge-Based Systems, 2021, 212, 106548. doi: 10.1016/j.knosys.2020.106548
7	HUANG Z H, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[EB/OL]. [2023-07-14]. http://arxiv.org/abs/1508.01991.
8	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. [S. l.]: Curran Associates Inc., 2017: 6000-6010.
9	LI X N, YAN H, QIU X P, et al. FLAT: Chinese NER using flat-lattice Transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2020: 6836-6842.
10	ZHANG Y, YANG J. Chinese NER using lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, USA: Association for Computational Linguistics, 2018: 1554-1564.
11	DU J F, GRAVE E, GUNEL B, et al. Self-training improves pre-training for natural language understanding[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, USA: Association for Computational Linguistics, 2021: 4171-4186.
12	LAN Z Z, CHEN M D, GOODMAN S, et al. ALBERT: a lite BERT for self-supervised learning of language representations[EB/OL]. [2023-07-14]. http://arxiv.org/abs/1909.11942.
13	LIU W, FU X Y, ZHANG Y, et al. Lexicon enhanced Chinese sequence labeling using BERT adapter[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg, USA: Association for Computational Linguistics, 2021: 5847-5858.
14	SUI D B, CHEN Y B, LIU K, et al. Leverage lexical knowledge for Chinese named entity recognition via collaborative graph network[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Stroudsburg, USA: Association for Computational Linguistics, 2019: 3828-3838.
15	VELIČKOVIĆ P, CUCURULL G, CASANOVA A, et al. Graph attention networks[EB/OL]. [2023-07-14]. http://arxiv.org/abs/1710.10903.
16	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. [2023-07-14]. http://arxiv.org/abs/1301.3781.
17	ZHANG S, WANG L J, SUN K, et al. A practical Chinese dependency parser based on a large-scale dataset[EB/OL]. [2023-07-14]. http://arxiv.org/abs/2009.00901.
18	ZENG Y, YANG H H, FENG Y S, et al. A convolution BiLSTM neural network model for Chinese event extraction[M]. Berlin, Germany: Springer International Publishing, 2016.
19	LEVOW G A. The third international Chinese language processing bakeoff: word segmentation and named entity recognition[EB/OL]. [2023-07-14]. https://www.semanticscholar.org/paper/The-Third-International-Chinese-Language-Processing-Levow/a1adf7e00a0c4134dbae5ad26662fa78e2ece055.
20	WEISCHEDEL R. OntoNotes release 4.0 LDC2011T03[EB/OL]. [2023-07-14]. https://catalog.ldc.upenn.edu/LDC2011T03.
21	全国知识图谱与语义计算大会. 任务二: 电子病历命名实体识别[EB/OL]. [2023-07-14]. https://www.sigkg.cn/ccks2017/?page_id=51.
	National Knowledge Graph and Semantic Computing Conference. Task 2: electronic medical record named entity recognition[EB/OL]. [2023-07-14]. https://www.sigkg.cn/ccks2017/?page_id=51. (in Chinese)
22	LAMPLE G, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, USA: Association for Computational Linguistics, 2016: 260-270.
23	XUE M G, YU B W, LIU T W, et al. Porous lattice Transformer encoder for Chinese NER[C]//Proceedings of the 28th International Conference on Computational Linguistics. Stroudsburg, USA: International Committee on Computational Linguistics, 2020: 3831-3841.
24	LU N J, ZHENG J, WU W, et al. Chinese clinical named entity recognition with word-level information incorporating dictionaries[C]//Proceedings of the International Joint Conference on Neural Networks (IJCNN). Washington D. C., USA: IEEE Press, 2019: 1-8.
25	QIU J H, ZHOU Y M, WANG Q, et al. Chinese clinical named entity recognition using residual dilated convolutional neural network with conditional random field. IEEE Transactions on NanoBioscience, 2019, 18 (3): 306- 315. doi: 10.1109/TNB.2019.2908678
26	唐国强, 高大启, 阮彤, 等. 融入语言模型和注意力机制的临床电子病历命名实体识别. 计算机科学, 2020, 47 (3): 211- 216. doi: 10.11896/jsjkx.190200259
	TANG G Q, GAO D Q, RUAN T, et al. Clinical electronic medical record named entity recognition incorporating language model and attention mechanism. Computer Science, 2020, 47 (3): 211- 216. doi: 10.11896/jsjkx.190200259
27	LI X Y, ZHANG H, ZHOU X H. Chinese clinical named entity recognition with variant neural structures based on BERT methods. Journal of Biomedical Informatics, 2020, 107, 103422. doi: 10.1016/j.jbi.2020.103422
28	罗凌, 杨志豪, 宋雅文, 等. 基于笔画ELMo和多任务学习的中文电子病历命名实体识别研究. 计算机学报, 2020, 43 (10): 1943- 1957. doi: 10.11897/SP.J.1016.2020.01943
	LUO L, YANG Z H, SONG Y W, et al. Chinese clinical named entity recognition based on stroke ELMo and multi-task learning. Chinese Journal of Computers, 2020, 43 (10): 1943- 1957. doi: 10.11897/SP.J.1016.2020.01943
29	SONG Y, SHI S M, LI J, et al. Directional skip-gram: explicitly distinguishing left and right context for word embeddings[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Stroudsburg, USA: Association for Computational Linguistics, 2018: 175-180.

[1]	李俊俊, 董建刚, 李坤. 基于Kubernetes的集群节能策略研究[J]. 计算机工程, 2024, 50(9): 82-91.
[2]	林畅, 郭伟, 任哲聪, 金海波. 基于Transformer的目标跟踪与分割统一算法[J]. 计算机工程, 2024, 50(9): 130-141.
[3]	李泽霖, 吕兆峰, 陈富强, 李克. 基于多跳信息融合的实体对齐模型[J]. 计算机工程, 2024, 50(9): 142-152.
[4]	王汝英, 马嘉骏, 董建强, 刘万龙, 张海涛, 尹凯, 赵博超. 基于MTS-BiGRU-DMHSA的工业负荷预测方法[J]. 计算机工程, 2024, 50(9): 169-178.
[5]	朱凯, 李理, 张彤, 江晟, 别一鸣. 基于Transformer的多阶段运动模糊图像修复网络[J]. 计算机工程, 2024, 50(9): 276-285.
[6]	张天鹏, 韩晶, 吕学强. 基于多任务学习的超分辨率辅助小目标检测[J]. 计算机工程, 2024, 50(9): 304-312.
[7]	郭敏, 张熙涵, 李阳. 融合注意力的教师互一致性半监督医学图像分割[J]. 计算机工程, 2024, 50(9): 313-323.
[8]	曾钰琦, 刘博, 钟柏昌, 钟瑾. 智慧教育下基于改进YOLOv8的学生课堂行为检测算法[J]. 计算机工程, 2024, 50(9): 344-355.
[9]	饶日昕, 王怡文, 曾砺志, 童心恬, 赵海涛. 面向废旧电缆检测的轻量化网络模型[J]. 计算机工程, 2024, 50(8): 22-30.
[10]	李华昱, 张智康, 闫阳, 岳阳. 基于知识图谱增强的领域多模态实体识别[J]. 计算机工程, 2024, 50(8): 31-39.
[11]	王蕾, 党时鹏, 潘丰. 基于卷积神经网络的隐匿性旁路预测模型[J]. 计算机工程, 2024, 50(8): 40-49.
[12]	陈瀚, 赵春蕾, 蒋昊达, 王春东. 基于融合模型与语义网络的App用户意图识别研究[J]. 计算机工程, 2024, 50(8): 50-63.
[13]	王夙喆, 张雪英, 陈晓玉, 李凤莲, 吴泽林. 基于有效注意力和GAN结合的脑卒中EEG增强算法[J]. 计算机工程, 2024, 50(8): 336-344.
[14]	王宇, 祁琦, 王纯, 许才. 储能变流器信号高精度故障诊断方法[J]. 计算机工程, 2024, 50(8): 389-396.
[15]	王炼红, 林飞鹏, 李潇瑶, 谌桂枝, 周莉. 融入课程知识图谱的KMAKT预测[J]. 计算机工程, 2024, 50(7): 23-31.

选择文件类型/文献管理软件名称

选择包含的内容