计算机工程

• •    

面向汉维机器翻译的BERT嵌入研究

  

  • 发布日期:2020-12-08

Research on BERT incorporation for the Chinese-Uyghur machine translation

  • Published:2020-12-08

摘要: 针对训练汉维机器翻译模型时存在的汉语-维吾尔语平行语料数据稀疏问题,本研究将汉语的基于变换器的双 向编码器表示模型(Bidirectional Encoder Representation from Transformers,BERT)嵌入到汉维神经机器 翻译模型当中,从而提高汉维机器翻译质量。本文对比了不同汉语 BERT 预训练模型编码信息的嵌入效果,探究了 BERT 不同隐藏层编码的信息对汉维神经机器翻译的效果影响,同时提出了一种两段式微调 BERT 的策略,通过一 系列对比实验总结出了将预训练语言模型 BERT 应用在汉维神经机器翻译中的最佳方法。在汉维公开数据集上的实 验结果显示,机器双语互译评估(BLEU)值提升了 1.64,可以有效提高汉维机器翻译系统的性能。

Abstract: Focusing on the issue of the data sparseness of Chinese-Uyghur parallel corpus when training Chinese-Uyghur machine translation model, this paper incorporates the Chinese Bidirectional Encoder Representation from Transformers (BERT) into the Chinese-Uyghur neural machine translation model to improve the quality of Chinese-Uyghur machine translation. This research compares the incorporation effects of different Chinese BERT which encode source language information, explores the effects of different hidden layers of Chinese BERT on Chinese-Uyghur neural machine translation, and proposes a two-stage BERT fine-tuning strategy, this paper summarizes the best method of applying the pre-trained language model Bert to the Chinese Uyghur neural machine translation by a series of comparative experiments. The experimental results on Chinese-Uyghur public dataset show that the BLEU value increases by 1.64, which can effectively improve the performance of the Chinese-Uyghur machine translation system.