Document-level Relation Extraction Method Based on Data Augmentation and Dynamic Threshold

doi:10.19678/j.issn.1000-3428.0070117

Computer Engineering ›› 2026, Vol. 52 ›› Issue (4): 131-139. doi: 10.19678/j.issn.1000-3428.0070117

• Computational Intelligence and Pattern Recognition • Previous Articles Next Articles

Document-level Relation Extraction Method Based on Data Augmentation and Dynamic Threshold

LIU Junping¹^,²^,³, HUANG Yuwei¹, HU Xinrong¹, PENG Tao¹, YAO Xun¹, WANG Bangchao¹, YANG Huali¹, ZHU Qiang¹^,*()

1. School of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan 430200, Hubei, China
2. Engineering Research Center of Hubei Province for Clothing Information, Wuhan 430200, Hubei, China
3. Hubei Provincial Engineering Research Center for Intelligent Textile and Fashion, Wuhan 430200, Hubei, China

Received:2024-07-15 Revised:2024-10-09 Online:2026-04-15 Published:2026-04-08
Contact: ZHU Qiang

基于数据增强和动态阈值的文档级关系抽取方法研究

刘军平¹^,²^,³, 黄宇威¹, 胡新荣¹, 彭涛¹, 姚迅¹, 王帮超¹, 杨华利¹, 朱强¹^,*()

1. 武汉纺织大学计算机与人工智能学院, 湖北武汉 430200
2. 湖北省服装信息化工程技术研究中心, 湖北武汉 430200
3. 纺织服装智能化湖北省工程研究中心, 湖北武汉 430200

通讯作者: 朱强
作者简介:
刘军平(CCF会员), 男, 副教授, 主研方向为自然语言处理、信息检索、工业大数据处理
黄宇威, 硕士研究生
胡新荣, 教授
彭涛, 教授
姚迅, 讲师
王帮超, 讲师
杨华利, 讲师
朱强(通信作者), 讲师
基金资助:
教育部人文社会科学研究一般项目(23YJAZH082); 湖北省教育科学规划重点课题(2022GA046); 国家自然科学基金青年科学基金项目(62102291); 国家自然科学基金青年科学基金项目(62307029); 湖北省自然科学基金计划项目(2024AFB736); 湖北省自然科学基金(2025AFC097)

Abstract

Abstract:

Relationship Extraction (RE) tasks in the biomedical field often face issues such as data scarcity, class imbalance, and multiple labels. To address these issues, a method that combines data augmentation with a dynamic threshold strategy is proposed. First, the GPT model is fine tuned using a custom loss function and new data is generated based on the Word2Vec model by obtaining feature templates. Second, the BERT classifier is used to screen the generated data, combining high-quality samples with the original dataset to form a richer training set. Finally, a learnable dynamic threshold strategy is proposed to dynamically adjust the classification threshold based on document length and the difference between model output and real labels, enabling the model to flexibly handle multi-label document problems. Experimental results on two publicly available medical datasets showed that the method achieved F1 values of 84.1% and 69.3%, which were 1.6 and 1.1 percentage points higher than those of the ATLOP method, respectively, verifying the effectiveness of the method.

Key words: Document-level Relation Extraction (DocRE), data augmentation, dynamic threshold, class imbalance, GPT model

摘要：

生物医学领域关系抽取(RE)任务通常存在数据稀缺、类别不平衡、多标签等问题。为了解决以上问题, 提出一种结合数据增强和动态阈值策略的方法。首先, 通过自定义损失函数对GPT模型进行微调, 并基于Word2Vec模型得到特征模板以生成新数据; 其次, 利用BERT分类器对生成数据进行筛选, 将高质量样本与原始数据集相结合, 形成更丰富的训练集; 最后, 提出一种可学习动态阈值策略, 根据文档长度及模型输出与真实标签的差异性, 动态调整分类阈值, 使模型能够灵活处理文档多标签问题。在2个公开医学数据集上的实验结果显示, 该方法分别取得了84.1%和69.3%的F1值, 相较ATLOP方法分别提升1.6和1.1百分点, 验证了该方法的有效性。

关键词: 文档级关系抽取, 数据增强, 动态阈值, 类别不平衡, GPT模型

LIU Junping, HUANG Yuwei, HU Xinrong, PENG Tao, YAO Xun, WANG Bangchao, YANG Huali, ZHU Qiang. Document-level Relation Extraction Method Based on Data Augmentation and Dynamic Threshold[J]. Computer Engineering, 2026, 52(4): 131-139.

刘军平, 黄宇威, 胡新荣, 彭涛, 姚迅, 王帮超, 杨华利, 朱强. 基于数据增强和动态阈值的文档级关系抽取方法研究[J]. 计算机工程, 2026, 52(4): 131-139.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0070117

https://www.ecice06.com/EN/Y2026/V52/I4/131

Figures/Tables 12

References 33

1	YIH W T, CHANG M W, HE X D, et al. Semantic parsing via staged query graph generation: question answering with knowledge base[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. [S. l. ]: ACL, 2015: 1321-1331.
2	TRISEDYA B D, WEIKUM G, QI J Z, et al. Neural relation extraction for knowledge base enrichment[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. [S. l. ]: ACL, 2019: 229-240.
3	LI Y Z, LEI Z. gBuilder: a scalable knowledge graph construction system for unstructured corpus[EB/OL]. [2024-05-05]. https://arxiv.org/abs/2208.09705.
4	ZHANG Y H, ZHONG V, CHEN D Q, et al. Position-aware attention and supervised data improve slot filling[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. [S. l. ]: ACL, 2017: 35-45.
5	HENDRICKX I, KIM S N, KOZAREVA Z, et al. SemEval-2010 Task 8: multi-way classification of semantic relations between pairs of nominals[EB/OL]. [2024-05-05]. http://www.osti.gov/cgi-bin/eprints/redirectEprintsUrl?http%3A%2F%2Faclweb.org%2Fanthology-new%2FS%2FS10%2FS10-1006.pdf.
6	YAO Y, YE D M, LI P, et al. DocRED: a large-scale document-level relation extraction dataset[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. [S. l. ]: ACL, 2019: 764-777.
7	CHEN Y H. Convolutional neural network for sentence classification[EB/OL]. [2024-05-05]. https://arxiv.org/abs/1408.5882.
8	NAM J, KIM J, LOZA M E, et al. Large-scale multi-label text classification—revisiting neural networks[EB/OL]. [2024-05-05]. https://link.springer.com/chapter/10.1007/978-3-662-44851-9_28.
9	李敬灿, 肖萃林, 覃晓婷, 等. 基于大语言模型与语义增强的文本关系抽取算法. 计算机工程, 2024, 50(4): 87- 94. doi: 10.19678/j.issn.1000-3428.0068501
	LI J C, XIAO C L, QIN X T, et al. Text-relation-extraction algorithm based on large-language model and semantic enhancement. Computer Engineering, 2024, 50(4): 87- 94. doi: 10.19678/j.issn.1000-3428.0068501
10	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. [2024-05-05]. https://ui.adsabs.harvard.edu/abs/2013arXiv1301.3781M/abstract.
11	ZHANG X, ZHAO J B, LECUN Y. Character-level convolutional networks for text classification[C]//Proceedings of the 29th International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2015: 649-657.
12	CAI H Y, CHEN H S, SONG Y H, et al. Data manipulation: towards effective instance learning for neural dialogue generation via learning to augment and reweight[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S. l. ]: ACL, 2020: 6334-6343.
13	KOBAYASHI S. Contextual augmentation: data augmentation by words with paradigmatic relations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics. [S. l. ]: ACL, 2018: 452-457.
14	WEI J, ZOU K. EDA: easy data augmentation techniques for boosting performance on text classification tasks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). [S. l. ]: ACL, 2019: 6381-6387.
15	MIN J, MCCOY R T, DAS D, et al. Syntactic data augmentation increases robustness to inference heuristics[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S. l. ]: ACL, 2020: 2339-2352.
16	SAHIN G G, STEEDMAN M. Data augmentation via dependency tree morphing for low-resource languages[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. [S. l. ]: ACL, 2018: 5004-5009.
17	LIU J B, QIN X Z, MA X Q, et al. FREDA: few-shot relation extraction based on data augmentation. Applied Sciences, 2023, 13(14): 8312. doi: 10.3390/app13148312
18	BAYER M, KAUFHOLD M A, BUCHHOLD B, et al. Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers. International Journal of Machine Learning and Cybernetics, 2023, 14(1): 135- 150. doi: 10.1007/s13042-022-01553-3
19	HU X M, LIU A W, TAN Z Q, et al. GDA: generative data augmentation techniques for relation extraction tasks[C]//Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023. [S. l. ]: ACL, 2023: 10221-10234.
20	LIU Y Q, YANG Z H, NING J Z, et al. Joint biomedical entity and relation extraction based on triple region vertices[C]//Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Washington D.C., USA: IEEE Press, 2024: 2117-2120.
21	XU W W, LI X, DENG Y, et al. PeerDA: data augmentation via modeling peer relation for span identification tasks[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. [S. l. ]: ACL, 2023: 8681-8699.
22	XU B F, WANG Q, LV Y J, et al. S2ynRE: two-stage self-training with synthetic data for low-resource relation extraction[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. [S. l. ]: ACL, 2023: 8186-8207.
23	ZHOU W X, HUANG K, MA T Y, et al. Document-level relation extraction with adaptive thresholding and localized context pooling. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(16): 14612- 14620. doi: 10.1609/aaai.v35i16.17717
24	RADFORD A. Language models are unsupervised multitask learners[EB/OL]. [2024-05-05]. http://web.archive.org/web/20190226183542/https:/d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.
25	RADFORD A, NARASIMHAN K. Improving language understanding by generative pre-training[EB/OL]. [2024-05-05]. https://gds.techfak.uni-bielefeld.de/_media/teaching/2024summer/advanced_ml/improving_language_understanding_by_generative_pretraining.pdf.
26	WOLF T, DEBUT L, SANH V, et al. Transformers: state-of-the-art natural language processing[EB/OL]. [2024-05-05]. https://www.researchgate.net/publication/347233464_Transformers_State-of-the-Art_Natural_Language_Processing.
27	LI J, SUN Y, JOHNSON R J, et al. BioCreative V CDR task corpus: a resource for chemical disease relation extraction[EB/OL]. [2024-05-05]. https://pdfs.semanticscholar.org/2d9d/71f9132036a1fd7ac732884fe97c60c418a6.pdf.
28	PAPANIKOLAOU Y, PIERLEONI A. DARE: data augmented relation extraction with GPT-2[EB/OL]. [2024-05-05]. https://arxiv.org/abs/2004.13845.
29	NAN G S, GUO Z J, SEKULIC I, et al. Reasoning with latent structure refinement for document-level relation extraction[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S. l. ]: ACL, 2020: 1546-1557.
30	SAHU S K, CHRISTOPOULOU F, MIWA M, et al. Inter-sentence relation extraction with document-level graph convolutional neural network[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. [S. l. ]: ACL, 2019: 4309-4316.
31	NGUYEN D Q, VERSPOOR K. Convolutional neural networks for chemical-disease relation extraction are improved with character-based word embeddings[C]//Proceedings of the BioNLP 2018 Workshop. [S. l. ]: ACL, 2018: 129-136.
32	CHRISTOPOULOU F, MIWA M, ANANIADOU S. Connecting the dots: document-level neural relation extraction with edge-oriented graphs[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. [S. l. ]: ACL, 2019: 4924-4935.
33	LEE J, YOON W, KIM S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2020, 36(4): 1234- 1240. doi: 10.1093/bioinformatics/btz682

[1]	ZHANG Zhen, YOU Lan, PENG Qingxi, JIN Hong, ZENG Haoqiu, XIA Yuchun. XSGCL: A Lightweight Graph Contrastive Learning Framework for Recommendation [J]. Computer Engineering, 2026, 52(4): 163-175.
[2]	GUO Tiansheng, XIE Jinkui. Adaptive Adjustment Graph Augmentation and Representation Structures for Recommendation Model [J]. Computer Engineering, 2026, 52(2): 69-78.
[3]	ZHU Siyuan, LI Jiasheng, ZOU Danping, HE Di, YU Wenxian. Unstructured Road Defect Detection Algorithm Based on Semi-Supervised Learning [J]. Computer Engineering, 2025, 51(9): 14-24.
[4]	LI Xiaoyu, LUO Na. Few-Shot Learning Method with Augmentation Data Based on Transferring Intra-Class Variations [J]. Computer Engineering, 2025, 51(9): 242-251.
[5]	MA Gan, GU Yu, PENG Dongliang. Combining Improved YOLOv5s and Dynamic Data Augmentation for Sea Surface Ship Detection [J]. Computer Engineering, 2025, 51(9): 294-305.
[6]	WANG Shuai, SHI Yancui. Self-Supervised Sequence Recommendation Algorithm Based on Personalized Data Augmentation [J]. Computer Engineering, 2025, 51(8): 190-202.
[7]	SHANG Yaming, WU Anbiao, YUAN Ye, WANG Yishu. Graph Neural Network Enhancement Based on Personalized PageRank Higher Order Neighborhood Aggregation [J]. Computer Engineering, 2025, 51(6): 38-48.
[8]	PANG Xin, GE Fengpei, LI Yanling. Soundscape Recognition: Explorations and Frontiers of Acoustic Scene Classification in the Digital Era [J]. Computer Engineering, 2025, 51(6): 1-19.
[9]	ZHANG Xingpeng, HE Dong, YANG Mo, YE Hangbin. Nucleus Segmentation Based on Multiscale Attention and Data Augmentation [J]. Computer Engineering, 2025, 51(2): 387-396.
[10]	XIA Qingqing, ZHU Yu, WANG Xiaoying, HUANG Jianqiang, CAO Tengfei. Heterogeneous Hypernetwork Representation Learning Based on Importance Sampling [J]. Computer Engineering, 2025, 51(11): 133-143.
[11]	CHEN Yanfei, LIU Sanmin. Online Learning Method for Class Imbalanced and Feature Evolvable Streams [J]. Computer Engineering, 2024, 50(9): 92-103.
[12]	Zhiwei LIN, Zuyuan YANG, Siqiu WANG, Chao YANG. Athlete Detection Algorithm Based on Multi-scale Linear Global Attention [J]. Computer Engineering, 2024, 50(7): 352-359.
[13]	Yiwen ZHANG, Manchun CAI, Yonghao CHEN, Yi ZHU, Lifeng YAO. Multi-Scale Deepfake Detection Method with Fusion of Spatial Features [J]. Computer Engineering, 2024, 50(7): 240-250.
[14]	GONG Ajuan, PAN Tianrong. Discussion on Deep-Learning Strategies for Diagnosis of Multiple Diseases in Fundus Diseases [J]. Computer Engineering, 2024, 50(5): 363-372.
[15]	ZHANG Baoxin, YANG Dan, NIE Tiezheng, KOU Yue. Recommendation Method Based on Self-supervised Multi-view Graph Collaborative Filtering [J]. Computer Engineering, 2024, 50(5): 100-110.

Please choose a citation manager

Content to export