摘要: 针对双语语料是开发统计机器翻译系统的重要资源,提出一种从网络中自动挖掘双语平行网页的方法。与传统从指定网站中挖掘平行网页的方法不同,该方法从整个互联网中自动挖掘平行网页,对新的语言对和内容领域有很强的适应能力,实现双语平行网页挖掘的系统。实验结果显示,该系统可以为统计机器翻译系统提供大量高质量的平行网页。
关键词:
自然语言处理,
统计机器翻译,
双语语料,
网络挖掘
Abstract: Aiming at bilingual corpora is critical resources for developing statistical machine translation system, this paper presents a method which automatically mines bilingual parallel Web page form Web. Different from mining data from pre-specified Web sites, the system is developed to mine parallel Web page from the entire Web, it is greatly suitable for new content domains and language pairs. It implements a parallel Web page mining system. Experimental results show that the system can provide large scale and high quality parallel Web page for statistical machine translation.
Key words:
natural language processing,
statistical machine translation,
bilingual corpora,
Web mining
中图分类号:
陈 伟;黄 蕾;刘 峰;赵志宏. 双语平行网页挖掘系统的设计与实现[J]. 计算机工程, 2009, 35(14): 267-269.
CHEN Wei; HUANG Lei; LIU Feng; ZHAO Zhi-hong. Design and Implementation of Bilingual Parallel Web Page Mining System[J]. Computer Engineering, 2009, 35(14): 267-269.