面向铜基复合材料文献的复杂实体关系抽取方法

doi:10.19678/j.issn.1000-3428.0069745

摘要/Abstract

摘要：

从铜基复合材料文献中有效抽取实体和关系对构建材料知识图谱并推动材料科学研究有重要作用。由于该领域语料的实体构成复杂(如嵌套实体和非连续实体)，且大量存在单实体重叠(SEO)关系，现有的实体关系抽取技术难以直接适用。为此，构建一个铜基复合材料实体关系抽取数据集，并提出一种两阶段实体关系抽取方法。第一阶段通过融合词间关系分类任务以及双向门控循环单元(BiGRU)和多粒度扩张卷积技术，提升了实体识别模型对实体跨度的识别能力。第二阶段在文本序列中标注实体信息，并在关系分类模型中引入实体类型注意力机制，以多特征表示来增强关系分类性能。在Matscholar、SOFC、MSP 3个公开数据集以及自建CBCM-IE数据集上的实验结果表明，该方法在精确率、召回率和F1值上相较基线方法平均有5.91、3.56和3.63百分点的提升，抽取性能较优。

关键词: 命名实体识别, 关系抽取, 预训练语言模型, 铜基复合材料

Abstract:

Extracting entities and relations with precision from the copper-based composite material literature is imperative for constructing knowledge graphs and propelling research in materials science. The complex nature of entities in this domain, such as nested and discontinuous entities, along with the prevalence of Single Entity Overlap (SEO) relations, renders existing techniques for entity and relation extraction inadequate. To address this issue, this study presents a dedicated dataset for entity relation extraction from copper-based composite materials and introduces a novel two-stage extraction method. The initial phase combines inter-word relation classification with Bidirectional Gated Recurrent Unit (BiGRU) and multi-scale dilated convolutional networks, thereby augmenting the model's capacity to discern entity boundaries. The second phase involves annotating entity spans within text sequences and incorporating an entity type attention mechanism into a relation classification model. This method leverages multifaceted feature representation to classify relations. On three established public datasets—Matscholar, SOFC, and MSP—as well as the CBCM-IE dataset curated for this research, the proposed method outperforms baseline methodologies with improvements of 5.91 (Precision), 3.56 (Recall), and 3.63 (F1 score) percentage points, demonstrating its efficacy for entity relation extraction in the context of copper-based composite materials.

Key words: named entity recognition, relation extraction, pretrained language model, copper-based composite material

郭桦宜, 游进国, 耿齐祁, 陶静梅, 易健宏. 面向铜基复合材料文献的复杂实体关系抽取方法[J]. 计算机工程, 2025, 51(11): 100-111.

GUO Huayi, YOU Jinguo, GENG Qiqi, TAO Jingmei, YI Jianhong. Complex Entity Relation Extraction Method for Copper-Based Composite Material Literatures[J]. Computer Engineering, 2025, 51(11): 100-111.

https://www.ecice06.com/CN/Y2025/V51/I11/100

图/表 14

图1 两阶段铜基复合材料文献实体关系抽取方法

Fig.1 Two-stage method for extracting entity relations from copper-based composite material literatures

图2 实体识别模型结构

Fig.2 Structure of entity recognition model

图3 关系分类模型结构

Fig.3 Structure of relation classification model

图4 数据集文本标记长度统计

Fig.4 Statistics of text token length in the dataset

图5 输入序列长度选择实验结果

Fig.5 Experimental results for selecting input sequence length

参考文献 27

1	WU S C, HE Y F. Enriching pre-trained language model with entity information for relation classification[C]//Proceedings of the 28th ACM International Conference on Information and Knowledge Management. New York, USA: ACM Press, 2019: 1-9.
2	HUANG Y , LI Z X , DENG W , et al. D-BERT: Incorporating dependency-based attention into BERT for relation extraction. CAAI Transactions on Intelligence Technology, 2021, 6 (4): 417- 425. doi: 10.1049/cit2.12033
3	ZHONG Z X, CHEN D Q. A frustratingly easy approach for entity and relation extraction[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, USA: ACL Press, 2021: 50-61.
4	宁尚明, 滕飞, 李天瑞. 基于多通道自注意力机制的电子病历实体关系抽取. 计算机学报, 2020, 43 (5): 916- 929.
	NING S M , TENG F , LI T R . Multi-channel self-attention mechanism for relation extraction in clinical records. Chinese Journal of Computers, 2020, 43 (5): 916- 929.
5	DAI X, KARIMI S, HACHEY B, et al. An effective transition-based model for discontinuous NER[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL Press, 2020: 5860-5870.
6	李政, 涂刚, 汪汉生. MKE: 基于背景知识与多头选择的嵌套命名实体识别. 中文信息学报, 2024, 38 (4): 86-98, 107.
	LI Z , TU G , WANG H S . MKE: nested NER based on knowledge embedding and multi-head selection. Journal of Chinese Information Processing, 2024, 38 (4): 86-98, 107.
7	WEI Z P, SU J L, WANG Y, et al. A novel cascade binary tagging framework for relational triple extraction[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL Press, 2020: 1476-1488.
8	LI C, TIAN Y. Downstream model design of pre-trained language model for relation extraction task[EB/OL]. [2024-03-11]. https://ar5iv.labs.arxiv.org/html/2004.03786.
9	YAN Z H, ZHANG C, FU J L, et al. A partition filter network for joint entity and relation extraction[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL Press, 2021: 185-197.
10	乔勇鹏, 于亚新, 刘树越, 等. 图卷积增强多路解码的实体关系联合抽取模型. 计算机研究与发展, 2023, 60 (1): 153- 166.
	QIAO Y P , YU Y X , LIU S Y , et al. Graph convolution-enhanced multi-channel decoding joint entity and relation extraction model. Journal of Computer Research and Development, 2023, 60 (1): 153- 166.
11	廖涛, 孙皓洁, 张顺香. 基于跨度和特征融合的实体关系联合抽取模型. 计算机工程, 2023, 49 (6): 107- 114. doi: 10.19678/j.issn.1000-3428.0065261
	LIAO T , SUN H J , ZHANG S X . Entity-relation joint extraction model based on span and feature fusion. Computer Engineering, 2023, 49 (6): 107- 114. doi: 10.19678/j.issn.1000-3428.0065261
12	WANG W R , JIANG X , TIAN S H , et al. Automated pipeline for superalloy data by text mining. NPJ Computational Materials, 2022, 8, 9. doi: 10.1038/s41524-021-00687-2
13	SWAIN M C , COLE J M . ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature. Journal of Chemical Information and Modeling, 2016, 56 (10): 1894- 1904. doi: 10.1021/acs.jcim.6b00207
14	KIM E , HUANG K , TOMALA A , et al. Machine-learned and codified synthesis parameters of oxide materials. Scientific Data, 2017, 4, 170127. doi: 10.1038/sdata.2017.127
15	SHETTY P , RAMPRASAD R . Automated knowledge extraction from polymer literature using natural language processing. iScience, 2021, 24 (1): 101922. doi: 10.1016/j.isci.2020.101922
16	GUPTA T , ZAKI M , ANOOP KRISHNAN N M , et al. MatSciBERT: a materials domain language model for text mining and information extraction. NPJ Computational Materials, 2022, 8, 102. doi: 10.1038/s41524-022-00784-w
17	GILLIGAN L P J , COBELLI M , TAUFOUR V , et al. A rule-free workflow for the automated generation of databases from scientific literature. NPJ Computational Materials, 2023, 9, 222. doi: 10.1038/s41524-023-01171-9
18	VENUGOPAL V , OLIVETTI E . MatKG: an autonomously generated knowledge graph in Material Science. Scientific Data, 2024, 11 (1): 217. doi: 10.1038/s41597-024-03039-z
19	CHOI J , LEE B . Accelerating materials language processing with large language models. Communications Materials, 2024, 5, 13. doi: 10.1038/s43246-024-00449-9
20	POLAK M P , MORGAN D . Extracting accurate materials data from research papers with conversational language models and prompt engineering. Nature Communications, 2024, 15 (1): 1569. doi: 10.1038/s41467-024-45914-8
21	RANA R. Gated Recurrent Unit (GRU) for emotion classification from noisy speech[EB/OL]. [2024-03-11]. https://ar5iv.labs.arxiv.org/html/1612.07778.
22	NAKAYAMA H, KUBO T, KAMURA J, et al. doccano: text annotation tool for human[EB/OL]. [2024-03-11]. https://github.com/doccano/doccano.
23	WESTON L , TSHITOYAN V , DAGDELEN J , et al. Named entity recognition and normalization applied to large-scale information extraction from the materials science literature. Journal of Chemical Information and Modeling, 2019, 59 (9): 3692- 3702. doi: 10.1021/acs.jcim.9b00470
24	FRIEDRICH A, ADEL H, TOMAZIC F, et al. The SOFC-exp corpus and neural approaches to information extraction in the materials science domain[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL Press, 2020: 1255-1268.
25	MYSORE S, JENSEN Z, KIM E, et al. The materials science procedural text corpus: annotating materials synthesis procedures with shallow semantic structures[C]//Proceedings of the 13th Linguistic Annotation Workshop. Stroudsburg, USA: ACL Press, 2019: 56-64.
26	EBERTS M, ULGES A. Span-based joint entity and relation extraction with Transformer pre-training[C]//Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain: International Committee on Computational Linguistics, 2020: 88-99.
27	LI J Y, FEI H, LIU J, et al. Unified named entity recognition as word-word relation classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2022: 10965-10973.

[1]	杨竣辉, 李苏晋. 融合位置和实体类别信息的中文命名实体识别[J]. 计算机工程, 2025, 51(3): 113-121.
[2]	孙丽郡, 孟繁军, 徐行健. 课程知识图谱构建技术研究综述[J]. 计算机工程, 2025, 51(11): 1-21.
[3]	杨润, 陈艳平, 闫家鑫, 秦永彬. 基于关联邻接矩阵的关系抽取方法研究[J]. 计算机工程, 2025, 51(10): 121-129.
[4]	周雪阳, 傅启明, 陈建平, 陈延明, 陆悠, 王蕴哲. 基于证据和图推理的文档级关系抽取方法: 以医学关系为例[J]. 计算机工程, 2025, 51(1): 106-117.
[5]	党小超, 刘涧, 董晓辉, 祝忠彦, 李芬芳. 面向不平衡数据的机械设备故障命名实体识别[J]. 计算机工程, 2024, 50(9): 104-112.
[6]	李华昱, 张智康, 闫阳, 岳阳. 基于知识图谱增强的领域多模态实体识别[J]. 计算机工程, 2024, 50(8): 31-39.
[7]	陈宇航, 杨勇, 先木斯亚·买买提明, 帕力旦·吐尔逊, 樊小超, 任鸽, 刁宇峰. 基于主题感知和语义增强的作文自动评分方法[J]. 计算机工程, 2024, 50(8): 363-371.
[8]	张华青, 夏张涛, 陆晓庆, 童基均. 基于字形特征的血管外科命名实体识别[J]. 计算机工程, 2024, 50(8): 13-21.
[9]	刘娟, 段友祥, 陆誉翕, 张鲁. 引入知识增强和对比学习的知识图谱补全[J]. 计算机工程, 2024, 50(7): 112-122.
[10]	陈佳玉, 王元龙, 张虎. 基于文本知识增强的问题生成模型[J]. 计算机工程, 2024, 50(6): 86-93.
[11]	曹渝昆, 程宇, 何祯奕, 徐康乐, 颜家洛, 李云峰. 文档上下文异构表示的句子级关系抽取方法[J]. 计算机工程, 2024, 50(5): 111-119.
[12]	隗昊, 刁宏悦, 孔亮宸, 邓耀臣. 东北亚舆情文本细粒度命名实体识别方法研究[J]. 计算机工程, 2024, 50(5): 354-362.
[13]	吴海鹏, 钱育蓉, 冷洪勇. 基于双向注意力机制的多模态关系抽取[J]. 计算机工程, 2024, 50(4): 160-167.
[14]	张洪程, 李林育, 杨莉, 伞晨峻, 尹春林, 颜冰, 于虹, 张璇. 基于对比学习与语言模型增强嵌入的知识图谱补全[J]. 计算机工程, 2024, 50(4): 168-176.
[15]	李敬灿, 肖萃林, 覃晓婷, 谢夏. 基于大语言模型与语义增强的文本关系抽取算法[J]. 计算机工程, 2024, 50(4): 87-94.

选择文件类型/文献管理软件名称

选择包含的内容