Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2009, Vol. 35 ›› Issue (14): 267-269. doi: 10.3969/j.issn.1000-3428.2009.14.093

• Developmental Research • Previous Articles     Next Articles

Design and Implementation of Bilingual Parallel Web Page Mining System

CHEN Wei, HUANG Lei, LIU Feng, ZHAO Zhi-hong   

  1. (Institute of Software, Nanjing University, Nanjing 210089)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-07-20 Published:2009-07-20

双语平行网页挖掘系统的设计与实现

陈 伟,黄 蕾,刘 峰,赵志宏   

  1. (南京大学软件学院,南京 210089)

Abstract: Aiming at bilingual corpora is critical resources for developing statistical machine translation system, this paper presents a method which automatically mines bilingual parallel Web page form Web. Different from mining data from pre-specified Web sites, the system is developed to mine parallel Web page from the entire Web, it is greatly suitable for new content domains and language pairs. It implements a parallel Web page mining system. Experimental results show that the system can provide large scale and high quality parallel Web page for statistical machine translation.

Key words: natural language processing, statistical machine translation, bilingual corpora, Web mining

摘要: 针对双语语料是开发统计机器翻译系统的重要资源,提出一种从网络中自动挖掘双语平行网页的方法。与传统从指定网站中挖掘平行网页的方法不同,该方法从整个互联网中自动挖掘平行网页,对新的语言对和内容领域有很强的适应能力,实现双语平行网页挖掘的系统。实验结果显示,该系统可以为统计机器翻译系统提供大量高质量的平行网页。

关键词: 自然语言处理, 统计机器翻译, 双语语料, 网络挖掘

CLC Number: