作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (2): 68-77. doi: 10.19678/j.issn.1000-3428.0067225

• 人工智能与模式识别 • 上一篇    下一篇

基于多策略强化学习的低资源跨语言摘要方法研究

冯雄波1,2,*(), 黄于欣1,2, 赖华1,2, 高玉梦1,2   

  1. 1. 昆明理工大学信息工程与自动化学院, 云南 昆明 650504
    2. 昆明理工大学云南省人工智能重点实验室, 云南 昆明 650504
  • 收稿日期:2023-03-22 出版日期:2024-02-15 发布日期:2024-02-21
  • 通讯作者: 冯雄波
  • 基金资助:
    国家自然科学基金(U21B2027); 云南省重大科技专项项目(202202AD080003); 云南省基础研究计划面上项目(202201AT070915); 云南省基础研究计划面上项目(202201AT070768); 昆明理工大学"双一流"创建联合专项(202201BE070001-021)

Research on Low-Resource Cross-Lingual Summarization Method Based on Multi-Strategy Reinforcement Learning

Xiongbo FENG1,2,*(), Yuxin HUANG1,2, Hua LAI1,2, Yumeng GAO1,2   

  1. 1. Faculty of Information Engineering and Automatic, Kunming University of Science and Technology, Kunming 650504, Yunnan, China
    2. Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650504, Yunnan, China
  • Received:2023-03-22 Online:2024-02-15 Published:2024-02-21
  • Contact: Xiongbo FENG

摘要:

跨语言摘要(CLS)旨在给定1个源语言文件(如越南语),生成目标语言(如中文)的摘要。端到端的CLS模型在大规模、高质量的标记数据基础上取得较优的性能,这些标记数据通常是利用机器翻译模型将单语摘要语料库翻译成CLS语料库而构建的。然而,由于低资源语言翻译模型的性能受限,因此翻译噪声会被引入到CLS语料库中,导致CLS模型性能降低。提出基于多策略的低资源跨语言摘要方法。利用多策略强化学习解决低资源噪声训练数据场景下的CLS模型训练问题,引入源语言摘要作为额外的监督信号来缓解翻译后的噪声目标摘要影响。通过计算源语言摘要和生成目标语言摘要之间的单词相关性和单词缺失程度来学习强化奖励,在交叉熵损失和强化奖励的约束下优化CLS模型。为验证所提模型的性能,构建1个有噪声的汉语-越南语CLS语料库。在汉语-越南语和越南语-汉语跨语言摘要数据集上的实验结果表明,所提模型ROUGE分数明显优于其他基线模型,相比NCLS基线模型,该模型ROUGE-1分别提升0.71和0.84,能够有效弱化噪声干扰,从而提高生成摘要的质量。

关键词: 汉语-越南语跨语言摘要, 低资源, 噪声数据, 噪声分析, 多策略强化学习

Abstract:

Cross-Lingual Summarization(CLS) aims to generate a summary in the target language(such as Chinese) given a source language file(such as Vietnamese). The end-to-end CLS model achieves better performance on large-scale and high-quality labeled data, which are usually constructed using models to machine translate monolingual abstract corpora into CLS corpora. However, the limited performance of low-resource language translation models, introduces noise into the CLS corpus, leading to a decrease in the performance of the CLS model. This paper proposes a low-resource CLS method based on multi-strategy. Using multi-strategy reinforcement learning to solve the training problem of CLS models in low-resource noise training data scenarios, whereby source language summaries are introduced as additional supervisory signals to alleviate the impact of translated noisy target summaries.To learn reinforcement rewards, the correlation and degree of missing words between the source and generated target language abstracts are calculated, thereby optimizing the CLS model under the constraints of cross entropy loss and reinforcement rewards. To verify the performance of the proposed model, a noisy Chinese-Vietnamese CLS corpus is constructed. The experimental results on the Chinese-Vietnamese and Vietnamese-Chinese CLS datasets show that the proposed model has significantly better ROUGE scores than the NCLS baseline model, improving ROUGE-1 by 0.71 and 0.84, respectively, effectively weakening noise interference and enhancing the quality of generated summaries

Key words: Chinese-Vietnamese Cross-Lingual Summarization(CLS), low-resource, noise data, noise analysis, multi-strategy reinforcement learning