作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2009, Vol. 35 ›› Issue (14): 267-269. doi: 10.3969/j.issn.1000-3428.2009.14.093

• 开发研究与设计技术 • 上一篇    下一篇

双语平行网页挖掘系统的设计与实现

陈 伟,黄 蕾,刘 峰,赵志宏   

  1. (南京大学软件学院,南京 210089)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-07-20 发布日期:2009-07-20

Design and Implementation of Bilingual Parallel Web Page Mining System

CHEN Wei, HUANG Lei, LIU Feng, ZHAO Zhi-hong   

  1. (Institute of Software, Nanjing University, Nanjing 210089)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-07-20 Published:2009-07-20

摘要: 针对双语语料是开发统计机器翻译系统的重要资源,提出一种从网络中自动挖掘双语平行网页的方法。与传统从指定网站中挖掘平行网页的方法不同,该方法从整个互联网中自动挖掘平行网页,对新的语言对和内容领域有很强的适应能力,实现双语平行网页挖掘的系统。实验结果显示,该系统可以为统计机器翻译系统提供大量高质量的平行网页。

关键词: 自然语言处理, 统计机器翻译, 双语语料, 网络挖掘

Abstract: Aiming at bilingual corpora is critical resources for developing statistical machine translation system, this paper presents a method which automatically mines bilingual parallel Web page form Web. Different from mining data from pre-specified Web sites, the system is developed to mine parallel Web page from the entire Web, it is greatly suitable for new content domains and language pairs. It implements a parallel Web page mining system. Experimental results show that the system can provide large scale and high quality parallel Web page for statistical machine translation.

Key words: natural language processing, statistical machine translation, bilingual corpora, Web mining

中图分类号: