摘要: 利用电话录音的汉维平行语料库和开源的Moses系统构建一个基于短语的统计机器翻译系统。针对汉维平行语料库规模较小和维吾尔语形态变化比较丰富的特点,通过对词级的语料库进行切分得到词素级的语料库,并分别进行词一级的实验和词素级的实验。实验表明,词素级的实验能降低无法识别的词的概率,提高翻译的质量。
关键词:
汉维,
维汉,
词素,
预处理,
后处理
Abstract: This paper gives a description of implementing a phrase-based machine translation system for Chinese-Uyghur, by the Moses toolkit, using a parallel corpus which is based on telephone recording. For the small scale parallel corpus and highly-inflected characteristics for Uyghur, it splits the Uyghur words into morphemes, and it gets another parallel corpus on morpheme-level. Experiments are carried out on word-level and morpheme-level separately, and show it can reduce the probability of Out-Of-Vocabulary(OOV) and improve the translation quality.
Key words:
Chinese-Uyghur,
Uyghur-Chinese,
morpheme-level,
preprocessing,
postprocessing
中图分类号:
董兴华, 周俊林, 郭树盛, 吐尔洪?吾司曼. 基于短语的汉维/维汉统计机器翻译[J]. 计算机工程, 2011, 37(9): 16-18,21.
DONG Xin-Hua, ZHOU Dun-Lin, GUO Shu-Cheng, TU Er-Hong-?Wu-Ci-Man. Phrase-based Chinese-Uyghur/Uyghur-Chinese Statistical Machine Translation[J]. Computer Engineering, 2011, 37(9): 16-18,21.