Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2024, Vol. 50 ›› Issue (4): 332-341. doi: 10.19678/j.issn.1000-3428.0068700

• Development Research and Engineering Application • Previous Articles     Next Articles

Research on Low-Resource Language Machine Translation for the ″Belt and Road″

Yutao HOU*(), Abulizi Abudukelimu, Yaqing SHI, Musideke Mayilamu, Abudukelimu Halidanmu   

  1. Department of Information Management, Xinjiang University of Finance and Economics, Urumqi 830012, Xinjiang, China
  • Received:2023-10-25 Online:2024-04-15 Published:2024-04-22
  • Contact: Yutao HOU

面向“一带一路”的低资源语言机器翻译研究

侯钰涛*(), 阿布都克力木·阿布力孜, 史亚庆, 马依拉木·木斯得克, 哈里旦木·阿布都克里木   

  1. 新疆财经大学信息管理学院, 新疆 乌鲁木齐 830012
  • 通讯作者: 侯钰涛
  • 基金资助:
    国家自然科学基金(61966033); 国家自然科学基金(62366050); 高层次人才专项(2022XGC060)

Abstract:

With the development of the ″Belt and Road″ initiative, the demand for cross-language communication between countries and regions along the ″Belt and Road″ has grown, and Machine Translation(MT) technology has gradually become an important means of in-depth exchange between countries. However, owing to the abundance of low-resource languages and scarcity of language materials in these countries, progress in machine translation research has been relatively slow. This paper proposes a low-resource language machine translation training method based on the NLLB model. An improved training strategy based on a multilingual pre-training model is deployed to optimize the loss function under the premise of data augmentation, thereby effectively improving the translation performance of low-resource languages in machine translation tasks. The ChatGPT and ChatGLM models are used to evaluate translation performance for Laotian-Chinese and Vietnamese-Chinese, respectively. Large Language Models (LLM) are already capable of translating low-resource languages, and the ChatGPT model significantly outperforms the traditional Neural Machine Translation (NMT) model in Vietnamese-Chinese translation tasks. However, the translation of Laotian requires further improvement. The experimental results show that compared to the NLLB-600M baseline model, the proposed model achieves average improvements of 1.33 in terms of BiLingual Evaluation Understudy (BLEU) score and 0.82 in terms of chrF++ score in Chinese translation tasks for four low-resource languages. These results fully demonstrate the effectiveness of the proposed method in low-resource language machine translation. In another experiment, this method uses the ChatGPT and ChatGLM models to conduct preliminary studies on Laotian-Chinese and Vietnamese-Chinese, respectively. In Vietnamese-Chinese translation tasks, the ChatGPT model significantly outperformed the traditional NMT models with a 9.28 improvement in BLEU score and 3.12 improvement in chrF++ score.

Key words: low-resource languages, Machine Translation(MT), data enhancement, multilingual pre-training models, Large Language Model(LLM)

摘要:

随着“一带一路”倡议的深入推进, 沿线国家和地区之间的跨语言沟通需求日渐增长, 机器翻译技术逐渐成为各国之间深入交流的重要手段。然而, 这些国家存在大量低资源语言, 语料的稀缺性导致其机器翻译研究进展较为缓慢。针对该问题, 提出一种基于NLLB模型改进的低资源语言机器翻译训练方法。首先基于多语言预训练模型提出一种改进的训练策略, 该策略在数据增强的前提下, 对损失函数进行优化, 从而在机器翻译任务中有效提高低资源语言的翻译性能; 然后使用ChatGPT以及ChatGLM模型分别评估老挝语-汉语以及越南语-汉语的翻译能力, 大语言模型(LLM)已具备一定的翻译低资源语言的能力, 而且ChatGPT模型在越南语-汉语翻译任务上已经大幅超越传统的神经机器翻译(NMT)模型, 但是在老挝语上的翻译性能还有待进一步提高。实验结果表明, 在4种低资源语言到汉语的翻译任务上, 相比NLLB-600M基线模型, 平均提升了1.33个双语替换测评(BLEU)值以及0.82个chrF++值, 从而充分证明了该方法在低资源语言机器翻译任务上的有效性。此外, 该方法使用ChatGPT和ChatGLM模型分别对老挝语-汉语以及越南语-汉语进行了初步研究, 在越南语-汉语翻译任务中, ChatGPT模型表现出色, 远超传统的NMT模型, 分别提高了9.28个BLEU值和3.12个chrF++值。

关键词: 低资源语言, 机器翻译, 数据增强, 多语言预训练模型, 大语言模型