作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (4): 332-341. doi: 10.19678/j.issn.1000-3428.0068700

• 开发研究与工程应用 • 上一篇    下一篇

面向"一带一路"的低资源语言机器翻译研究

侯钰涛, 阿布都克力木·阿布力孜, 史亚庆, 马依拉木·木斯得克, 哈里旦木·阿布都克里木   

  1. 新疆财经大学信息管理学院, 新疆 乌鲁木齐 830012
  • 收稿日期:2023-10-25 修回日期:2024-01-02 发布日期:2024-04-22
  • 通讯作者: 侯钰涛,E-mail:abdklmhldm@163.com E-mail:abdklmhldm@163.com
  • 基金资助:
    国家自然科学基金(61966033,62366050);高层次人才专项(2022XGC060)。

Research on Low-Resource Language Machine Translation for the ″Belt and Road″

HOU Yutao, Abudukelimu Abulizi, SHI Yaqing, Mayilamu Musideke, Halidanmu Abudukelimu   

  1. Department of Information Management, Xinjiang University of Finance and Economics, Urumqi 830012, Xinjiang, China
  • Received:2023-10-25 Revised:2024-01-02 Published:2024-04-22

摘要: 随着"一带一路"倡议的深入推进,沿线国家和地区之间的跨语言沟通需求日渐增长,机器翻译技术逐渐成为各国之间深入交流的重要手段。然而,这些国家存在大量低资源语言,语料的稀缺性导致其机器翻译研究进展较为缓慢。针对该问题,提出一种基于NLLB模型改进的低资源语言机器翻译训练方法。首先基于多语言预训练模型提出一种改进的训练策略,该策略在数据增强的前提下,对损失函数进行优化,从而在机器翻译任务中有效提高低资源语言的翻译性能;然后使用ChatGPT以及ChatGLM模型分别评估老挝语-汉语以及越南语-汉语的翻译能力,大语言模型(LLM)已具备一定的翻译低资源语言的能力,而且ChatGPT模型在越南语-汉语翻译任务上已经大幅超越传统的神经机器翻译(NMT)模型,但是在老挝语上的翻译性能还有待进一步提高。实验结果表明,在4种低资源语言到汉语的翻译任务上,相比NLLB-600M基线模型,平均提升了1.33个双语替换测评(BLEU)值以及0.82个chrF++值,从而充分证明了该方法在低资源语言机器翻译任务上的有效性。此外,该方法使用ChatGPT和ChatGLM模型分别对老挝语-汉语以及越南语-汉语进行了初步研究,在越南语-汉语翻译任务中,ChatGPT模型表现出色,远超传统的NMT模型,分别提高了9.28个BLEU值和3.12个chrF++值。

关键词: 低资源语言, 机器翻译, 数据增强, 多语言预训练模型, 大语言模型

Abstract: With the development of the ″Belt and Road″ initiative, the demand for cross-language communication between countries and regions along the ″Belt and Road″ has grown, and Machine Translation(MT) technology has gradually become an important means of in-depth exchange between countries. However, owing to the abundance of low-resource languages and scarcity of language materials in these countries, progress in machine translation research has been relatively slow. This paper proposes a low-resource language machine translation training method based on the NLLB model. An improved training strategy based on a multilingual pre-training model is deployed to optimize the loss function under the premise of data augmentation, thereby effectively improving the translation performance of low-resource languages in machine translation tasks. The ChatGPT and ChatGLM models are used to evaluate translation performance for Laotian-Chinese and Vietnamese-Chinese, respectively. Large Language Models (LLM) are already capable of translating low-resource languages, and the ChatGPT model significantly outperforms the traditional Neural Machine Translation (NMT) model in Vietnamese-Chinese translation tasks. However, the translation of Laotian requires further improvement. The experimental results show that compared to the NLLB-600M baseline model, the proposed model achieves average improvements of 1.33 in terms of BiLingual Evaluation Understudy (BLEU) score and 0.82 in terms of chrF++ score in Chinese translation tasks for four low-resource languages. These results fully demonstrate the effectiveness of the proposed method in low-resource language machine translation. In another experiment, this method uses the ChatGPT and ChatGLM models to conduct preliminary studies on Laotian-Chinese and Vietnamese-Chinese, respectively. In Vietnamese-Chinese translation tasks, the ChatGPT model significantly outperformed the traditional NMT models with a 9.28 improvement in BLEU score and 3.12 improvement in chrF++ score.

Key words: low-resource languages, Machine Translation(MT), data enhancement, multilingual pre-training models, Large Language Model(LLM)

中图分类号: