Research on Low-Resource Language Machine Translation for the ″Belt and Road″

doi:10.19678/j.issn.1000-3428.0068700

Abstract

Abstract:

With the development of the ″Belt and Road″ initiative, the demand for cross-language communication between countries and regions along the ″Belt and Road″ has grown, and Machine Translation(MT) technology has gradually become an important means of in-depth exchange between countries. However, owing to the abundance of low-resource languages and scarcity of language materials in these countries, progress in machine translation research has been relatively slow. This paper proposes a low-resource language machine translation training method based on the NLLB model. An improved training strategy based on a multilingual pre-training model is deployed to optimize the loss function under the premise of data augmentation, thereby effectively improving the translation performance of low-resource languages in machine translation tasks. The ChatGPT and ChatGLM models are used to evaluate translation performance for Laotian-Chinese and Vietnamese-Chinese, respectively. Large Language Models (LLM) are already capable of translating low-resource languages, and the ChatGPT model significantly outperforms the traditional Neural Machine Translation (NMT) model in Vietnamese-Chinese translation tasks. However, the translation of Laotian requires further improvement. The experimental results show that compared to the NLLB-600M baseline model, the proposed model achieves average improvements of 1.33 in terms of BiLingual Evaluation Understudy (BLEU) score and 0.82 in terms of chrF++ score in Chinese translation tasks for four low-resource languages. These results fully demonstrate the effectiveness of the proposed method in low-resource language machine translation. In another experiment, this method uses the ChatGPT and ChatGLM models to conduct preliminary studies on Laotian-Chinese and Vietnamese-Chinese, respectively. In Vietnamese-Chinese translation tasks, the ChatGPT model significantly outperformed the traditional NMT models with a 9.28 improvement in BLEU score and 3.12 improvement in chrF++ score.

Key words: low-resource languages, Machine Translation(MT), data enhancement, multilingual pre-training models, Large Language Model(LLM)

摘要：

随着“一带一路”倡议的深入推进, 沿线国家和地区之间的跨语言沟通需求日渐增长, 机器翻译技术逐渐成为各国之间深入交流的重要手段。然而, 这些国家存在大量低资源语言, 语料的稀缺性导致其机器翻译研究进展较为缓慢。针对该问题, 提出一种基于NLLB模型改进的低资源语言机器翻译训练方法。首先基于多语言预训练模型提出一种改进的训练策略, 该策略在数据增强的前提下, 对损失函数进行优化, 从而在机器翻译任务中有效提高低资源语言的翻译性能; 然后使用ChatGPT以及ChatGLM模型分别评估老挝语-汉语以及越南语-汉语的翻译能力, 大语言模型(LLM)已具备一定的翻译低资源语言的能力, 而且ChatGPT模型在越南语-汉语翻译任务上已经大幅超越传统的神经机器翻译(NMT)模型, 但是在老挝语上的翻译性能还有待进一步提高。实验结果表明, 在4种低资源语言到汉语的翻译任务上, 相比NLLB-600M基线模型, 平均提升了1.33个双语替换测评(BLEU)值以及0.82个chrF++值, 从而充分证明了该方法在低资源语言机器翻译任务上的有效性。此外, 该方法使用ChatGPT和ChatGLM模型分别对老挝语-汉语以及越南语-汉语进行了初步研究, 在越南语-汉语翻译任务中, ChatGPT模型表现出色, 远超传统的NMT模型, 分别提高了9.28个BLEU值和3.12个chrF++值。

关键词: 低资源语言, 机器翻译, 数据增强, 多语言预训练模型, 大语言模型

Yutao HOU, Abulizi Abudukelimu, Yaqing SHI, Musideke Mayilamu, Abudukelimu Halidanmu. Research on Low-Resource Language Machine Translation for the ″Belt and Road″[J]. Computer Engineering, 2024, 50(4): 332-341.

侯钰涛, 阿布都克力木·阿布力孜, 史亚庆, 马依拉木·木斯得克, 哈里旦木·阿布都克里木. 面向“一带一路”的低资源语言机器翻译研究[J]. 计算机工程, 2024, 50(4): 332-341.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0068700

http://www.ecice06.com/EN/Y2024/V50/I4/332

Figures/Tables 11

Fig.1 Back translation

Fig.2 Schematic diagram of the Transformer Encoder integrated into the MoE layer

Fig.3 Experimental results of ICL Few-shot

Fig.4 Experimental results of CoT Few-shot

Fig.5 Vietnamese-Chinese translation results based on ChatGLM

References 43

1	李宇明. "一带一路" 需要语言铺路. 中国科技术语, 2015, 17(6): 62. URL
	LI Y M. "Belt and Road" needs language to pave the way. China Terminology, 2015, 17(6): 62. URL
2	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. [2023-09-20]. http://arxiv.org/abs/1706.03762.
3	JOSHI P, SANTY S, BUDHIRAJA A, et al. The state and fate of linguistic diversity and inclusion in the NLP world[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2020: 6282-6293.
4	LEE E S, THILLAINATHAN S, NAYAK S, et al. Pre-trained multilingual sequence-to-sequence models: a hope for low-resource language translation? [C]//Proceedings of the Findings of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2022: 58-67.
5	ZENG A H, LIU X, DU Z X, et al. GLM-130B: an open bilingual pre-trained model[EB/OL]. [2023-09-20]. http://arxiv.org/abs/2210.02414.
6	YU Z Q, YU Z T, XIAN Y T, et al. Improving chinese-vietnamese neural machine translation with linguistic differences. ACM Transactions on Asian and Low-Resource Language Information Processing, 2022, 21(2): 22.
7	BATSUKH B E, BEGZ C, SANJAA B. English-Mongolian, Mongolian-English neural machine translation. Asian Journal of Social Science Studies, 2022, 7(3): 36. doi: 10.20849/ajsss.v7i3.999
8	YU Z Q, HUANG Y X, GUO J J. Improving Thai-Lao neural machine translation with similarity lexicon. Journal of Intelligent & Fuzzy Systems, 2022, 42(4): 4005- 4014.
9	HAN B, WU Y, HU G, et al. Lan-bridge mt's participation in the WMT 2022 general translation shared task[C]//Proceedings of the 7th Conference on Machine Translation. Stroudsburg, USA: Association for Computational Linguistics, 2022: 268-274.
10	ZHU J H, XIA Y C, WU L J, et al. Incorporating BERT into neural machine translation[EB/OL]. [2023-09-20]. http://arxiv.org/abs/2002.06823.
11	SUN Z W, WANG M X, LI L. Multilingual translation via grafting pre-trained language models[C]//Proceedings of the Findings of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2021: 2735-2747.
12	CHEN G H, MA S M, CHEN Y, et al. Towards making the most of cross-lingual transfer for zero-shot neural machine translation[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2022: 142-157.
13	LIU Y H, GU J T, GOYAL N, et al. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 2020, 8, 726- 742. doi: 10.1162/tacl_a_00343
14	XUE L T, CONSTANT N, ROBERTS A, et al. MT5: a massively multilingual pre-trained text-to-text transformer[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, USA: Association for Computational Linguistics, 2021: 483-498.
15	LI P F, LI L Y, ZHANG M, et al. Universal conditional masked language pre-training for neural machine translation[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2022: 6379-6391.
16	RANATHUNGA S, LEE E S A, SKENDULI M P, et al. Neural machine translation for low-resource languages: a survey. ACM Computing Surveys, 2023, 55(11): 229.
17	冯笑, 杨雅婷, 董瑞, 等. 基于回译和集成学习的维汉神经机器翻译方法. 兰州理工大学学报, 2022, 48(5): 99- 106. URL
	FENG X, YANG Y T, DONG R, et al. Uyghur-Chinese neural machine translation method based on back translation and ensemble learning. Journal of Lanzhou University of Technology, 2022, 48(5): 99- 106. URL
18	宜年, 艾山· 吾买尔, 买合木提· 买买提, 等. 基于多种数据筛选的维汉神经机器翻译. 厦门大学学报(自然科学版), 2022, 61(4): 660- 666. URL
	YI N, Aishan Wumaier, Maihemuti Maimaiti, et al. Uyghur-Chinese neural machine translation system based on multiple data filtering. Journal of Xiamen University (Natural Science), 2022, 61(4): 660- 666. URL
19	LIU X E, HE J S, LIU M Z, et al. A scenario-generic neural machine translation data augmentation method. Electronics, 2023, 12(10): 2320. doi: 10.3390/electronics12102320
20	PHAM N L, VAN VINH NGUYEN, PHAM T V. A data augmentation method for english-vietnamese neural machine translation. IEEE Access, 2023, 11, 28034- 28044.
21	YİRMİBEŞOǦLU Z, GÜNGÖR T. Morphologically motivated input variations and data augmentation in Turkish-English neural machine translation. ACM Transactions on Asian and Low-Resource Language Information Processing, 2023, 22(3): 92.
22	KARYUKIN V, RAKHIMOVA D, KARIBAYEVA A, et al. The neural machine translation models for the low-resource Kazakh-English language pair. PeerJ Computer Science, 2023, 9, e1224.
23	LI B, WENG Y X, XIA F, et al. Towards better Chinese-Centric neural machine translation for low-resource languages[EB/OL]. [2023-09-20]. http://arxiv.org/abs/2204.04344.
24	LI B, WENG Y X, SUN B, et al. A multi-tasking and multi-stage Chinese minority pre-trained language model[C]//Proceedings of China Conference on Machine Translation. Berlin, Germany: Springer, 2022: 93-105.
25	VAN H P, LE THANH H. Improving Khmer-Vietnamese machine translation with data augmentation methods[C]//Proceedings of the 11th International Symposium on Information and Communication Technology. New York, USA: ACM Press, 2022: 276-282.
26	TEAM N, COSTA-JUSSÀ M R, CROSS J, et al. No language left behind: scaling human-centered machine translation[EB/OL]. [2023-09-20]. http://arxiv.org/abs/2207.04672.
27	WEI J, TAY Y, BOMMASANI R, et al. Emergent abilities of large language models[EB/OL]. [2023-09-20]. http://arxiv.org/abs/2206.07682.pdf.
28	BRIAKOU E, CHERRY C, FOSTER G. Searching for needles in a haystack: on the role of incidental bilingualism in PaLM's translation capability[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2023: 9432- 9452.
29	VILAR D, FREITAG M, CHERRY C, et al. Prompting PaLM for translation: assessing strategies and performance[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2023: 15406-15427.
30	PENG K Q, DING L, ZHONG Q H, et al. Towards making the most of ChatGPT for machine translation[EB/OL]. [2023-09-20]. http://arxiv.org/abs/2303.13780.
31	GAO Y, WANG R L, HOU F. How to design translation prompts for ChatGPT: an empirical study[EB/OL]. [2023-09-20]. http://arxiv.org/abs/2304.02182.
32	MOSLEM Y, HAQUE R, KELLEHER J D, et al. Adaptive machine translation with large language models[EB/OL]. [2023-09-20]. http://arxiv.org/abs/2301.13294.
33	ZHU W H, LIU H Y, DONG Q X, et al. Multilingual machine translation with large language models: empirical results and analysis[EB/OL]. [2023-09-20]. http://arxiv.org/abs/2304.04675.
34	MU Y Y, REHEMAN A, CAO Z Q, et al. Augmenting large language model translators via translation memories[C]//Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023. Stroudsburg, USA: Association for Computational Linguistics, 2023: 10287-10299.
35	WEI J, WANG X Z, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[EB/OL]. [2023-09-20]. http://arxiv.org/abs/2201.11903.pdf.
36	ZHONG Q H, DING L, LIU J H, et al. Can ChatGPT understand too? A comparative study on ChatGPT and fine-tuned BERT[EB/OL]. [2023-09-20]. http://arxiv.org/abs/2302.10198.
37	TAN Z X, ZHANG X W, WANG S, et al. MSP: multi-stage prompting for making pre-trained language models better translators[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2022: 6131-6142.
38	JIAO W X, WANG W X, HUANG J T, et al. Is ChatGPT A good translator? yes with GPT-4 as the engine[EB/OL]. [2023-09-20]. http://arxiv.org/abs/2301.08745.
39	HUANG X S, CHEN Y B, WU S, et al. Named entity recognition via noise aware training mechanism with data filter[C]//Proceedings of the Findings of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2021: 4791-4803.
40	SENNRICH R, HADDOW B, BIRCH A. Improving neural machine translation models with monolingual data[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2016: 86-96.
41	HU E J, SHEN Y L, WALLIS P, et al. LoRA: low-rank adaptation of large language models[EB/OL]. [2023-09-20]. http://arxiv.org/abs/2106.09685.pdf.
42	PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. New York, USA: ACM Press, 2002: 311-318.
43	POPOVIĆ M. chrF++: words helping character n-grams[C]//Proceedings of the 2nd Conference on Machine Translation. Stroudsburg, USA: Association for Computational Linguistics, 2017: 612-618.

[1]	Halidanmu ABUDUKELIMU, Yutao HOU, Dengfeng YAO, Abudukelimu ABULIZI, Jishang CHEN. Survey of Uyghur Machine Translation Research [J]. Computer Engineering, 2024, 50(1): 1-16.
[2]	XI Rongkang, CAI Manchun, LU Tianliang. Tor Traffic Analysis Model Based on Data Enhancement and Stream Data Processing [J]. Computer Engineering, 2023, 49(3): 177-184.
[3]	Jian CAO, Yimei CHEN, Haisheng LI, Qiang CAI. Survey of Small Target Detection on Roads Based on Deep Learning [J]. Computer Engineering, 2023, 49(10): 1-12.
[4]	CAO Ruiyang, GUO Youmin, NIU Manyu. Integrated Enhancement Method for Multi-Center Data Based on Max-Min Distance [J]. Computer Engineering, 2022, 48(6): 174-181.
[5]	JIANG Yun,TAN Ning,ZHANG Hai,PENG Tingting. Bitewing Radiography Image Segmentation Based on Conditional Generative Adversarial Network [J]. Computer Engineering, 2019, 45(4): 223-227.

Please choose a citation manager

Content to export