作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (7): 32-41. doi: 10.19678/j.issn.1000-3428.0068625

• 智慧教育 • 上一篇    下一篇

基于大语言模型的教育文本幂等摘要方法

杨兴睿1, 马斌2,*(), 李森垚1,*(), 钟忺1,2   

  1. 1. 武汉理工大学计算机与人工智能学院, 湖北 武汉 430070
    2. 武汉理工大学信息化办公室, 湖北 武汉 430070
  • 收稿日期:2023-10-19 出版日期:2024-07-15 发布日期:2024-07-24
  • 通讯作者: 马斌, 李森垚
  • 基金资助:
    国家自然科学基金(62271361)

Large Language Model-based Idempotent Summarization Method for Educational Text

Xingrui YANG1, Bin MA2,*(), Senyao LI1,*(), Xian ZHONG1,2   

  1. 1. School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan 430070, Hubei, China
    2. Informatization Office, Wuhan University of Technology, Wuhan 430070, Hubei, China
  • Received:2023-10-19 Online:2024-07-15 Published:2024-07-24
  • Contact: Bin MA, Senyao LI

摘要:

大语言模型在自然语言处理领域蓬勃发展, 但在教育数字化领域应用过程中仍面临一系列重要挑战。针对教育数字化领域垂域数据稀缺、摘要长度不稳定导致信息缺失或冗余的问题, 提出一种用于教育领域文本摘要的轻量化幂等模型框架IGLM。该模型首先采用多源训练进行自适应扩增以提升数据多样性, 然后对下游的文本摘要任务进行多种微调。同时, 为降低文本长度的影响, 设计幂等摘要生成策略拉近初次摘要与幂等摘要来约束模型, 减少语料分布不均导致的偏见, 结合量化技术在低资源条件下生成更为精确和流畅的摘要文本。实验以ROUGE分数为评估指标, 在公开中文文本摘要数据集LCSTS、EDUCATION、NLPCC上进行验证。实验结果表明, 该框架在生成摘要的准确率和流畅性上有明显提升, 其中ROUGE-1/2/L相较基线模型在LCSTS数据集上分别提升7.9、7.4、8.7个百分点, 在EDUCATION数据集上分别提升12.9、15.4、15.7个百分点, 在NLPCC数据集上分别提升12.2、11.7、12.7个百分点, 验证了模型有效性。

关键词: 教育数字化, 文本摘要, 大语言模型, 低资源场景, 幂等, 扩增

Abstract:

Large Language Models(LLMs) are currently undergoing vigorous development in the field of Natural Language Processing(NLP). However, significant challenges remain in their applications to educational digitization. To address the problem posed by the scarcity of domain-specific data and the instability of summarization leading to information loss or redundancy, this study introduces a lightweight idempotent model framework, Idempotent Generative Language Model(IGLM), for educational text summarization. The model first employs multisource training for adaptive augmentation to enhance data diversity. Subsequently, various fine-tuning procedures are applied to the downstream text summarization task. Concurrently, an idempotent summarization generation strategy is designed to mitigate the impact of text length. This strategy brings the summaries closer to idempotent form, constrains the model, mitigates biases resulting from uneven language corpora, and combines quantization techniques to generate more precise and fluent summaries under low-resource conditions. The experiments used Recall-Oriented Understudy for Gisting Evaluation(ROUGE) scores as the evaluation metric and validated the model on publicly available Chinese text summarization datasets Large-scale Chinese Short Text Summarization(LCSTS), EDUCATION, and Natural Language Processing and Chinese Computing(NLPCC). The results revealed significant enhancements in precision and coherence within this framework. Specifically, compared to the baseline model, the ROUGE-1/2/L scores were improved by 7.9, 7.4, and 8.7 percentage points on the LCSTS dataset. Moreover, on the EDUCATION dataset, the scores exhibited enhancements of 12.9, 15.4, and 15.7 percentage points for ROUGE-1/2/L. Similarly, on the NLPCC dataset, improvements of 12.2, 11.7, and 12.7 percentage points were observed for ROUGE-1/2/L. This validation confirms the model's effectiveness.

Key words: educational digitalization, text summarization, Large Language Model(LLM), low-resource scenarios, idempotent, augmentation