Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2026, Vol. 52 ›› Issue (4): 131-139. doi: 10.19678/j.issn.1000-3428.0070117

• Computational Intelligence and Pattern Recognition • Previous Articles     Next Articles

Document-level Relation Extraction Method Based on Data Augmentation and Dynamic Threshold

LIU Junping1,2,3, HUANG Yuwei1, HU Xinrong1, PENG Tao1, YAO Xun1, WANG Bangchao1, YANG Huali1, ZHU Qiang1,*()   

  1. 1. School of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan 430200, Hubei, China
    2. Engineering Research Center of Hubei Province for Clothing Information, Wuhan 430200, Hubei, China
    3. Hubei Provincial Engineering Research Center for Intelligent Textile and Fashion, Wuhan 430200, Hubei, China
  • Received:2024-07-15 Revised:2024-10-09 Online:2026-04-15 Published:2026-04-08
  • Contact: ZHU Qiang

基于数据增强和动态阈值的文档级关系抽取方法研究

刘军平1,2,3, 黄宇威1, 胡新荣1, 彭涛1, 姚迅1, 王帮超1, 杨华利1, 朱强1,*()   

  1. 1. 武汉纺织大学计算机与人工智能学院, 湖北 武汉 430200
    2. 湖北省服装信息化工程技术研究中心, 湖北 武汉 430200
    3. 纺织服装智能化湖北省工程研究中心, 湖北 武汉 430200
  • 通讯作者: 朱强
  • 作者简介:

    刘军平(CCF会员), 男, 副教授, 主研方向为自然语言处理、信息检索、工业大数据处理

    黄宇威, 硕士研究生

    胡新荣, 教授

    彭涛, 教授

    姚迅, 讲师

    王帮超, 讲师

    杨华利, 讲师

    朱强(通信作者), 讲师

  • 基金资助:
    教育部人文社会科学研究一般项目(23YJAZH082); 湖北省教育科学规划重点课题(2022GA046); 国家自然科学基金青年科学基金项目(62102291); 国家自然科学基金青年科学基金项目(62307029); 湖北省自然科学基金计划项目(2024AFB736); 湖北省自然科学基金(2025AFC097)

Abstract:

Relationship Extraction (RE) tasks in the biomedical field often face issues such as data scarcity, class imbalance, and multiple labels. To address these issues, a method that combines data augmentation with a dynamic threshold strategy is proposed. First, the GPT model is fine tuned using a custom loss function and new data is generated based on the Word2Vec model by obtaining feature templates. Second, the BERT classifier is used to screen the generated data, combining high-quality samples with the original dataset to form a richer training set. Finally, a learnable dynamic threshold strategy is proposed to dynamically adjust the classification threshold based on document length and the difference between model output and real labels, enabling the model to flexibly handle multi-label document problems. Experimental results on two publicly available medical datasets showed that the method achieved F1 values of 84.1% and 69.3%, which were 1.6 and 1.1 percentage points higher than those of the ATLOP method, respectively, verifying the effectiveness of the method.

Key words: Document-level Relation Extraction (DocRE), data augmentation, dynamic threshold, class imbalance, GPT model

摘要:

生物医学领域关系抽取(RE)任务通常存在数据稀缺、类别不平衡、多标签等问题。为了解决以上问题, 提出一种结合数据增强和动态阈值策略的方法。首先, 通过自定义损失函数对GPT模型进行微调, 并基于Word2Vec模型得到特征模板以生成新数据; 其次, 利用BERT分类器对生成数据进行筛选, 将高质量样本与原始数据集相结合, 形成更丰富的训练集; 最后, 提出一种可学习动态阈值策略, 根据文档长度及模型输出与真实标签的差异性, 动态调整分类阈值, 使模型能够灵活处理文档多标签问题。在2个公开医学数据集上的实验结果显示, 该方法分别取得了84.1%和69.3%的F1值, 相较ATLOP方法分别提升1.6和1.1百分点, 验证了该方法的有效性。

关键词: 文档级关系抽取, 数据增强, 动态阈值, 类别不平衡, GPT模型