Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2006, Vol. 32 ›› Issue (16): 49-51. doi: 10.3969/j.issn.1000-3428.2006.16.019

• Software Technology and Database • Previous Articles     Next Articles

Web Information Extraction and Corpus Construction System with C#

LIU Hua   

  1. Department of Applied Linguistics, College of Chinese Language and Culture, Jinan University, Guangzhou 510610
  • Received:1900-01-01 Revised:1900-01-01 Online:2006-08-20 Published:2006-08-20

网页信息抽取及建库系统C#实现

刘 华   

  1. 暨南大学华文学院应用语言学系,广州 510610

Abstract: This paper describes an intelligentized and individuation system for Web information extraction and corpus construction with C#. which includes Web pages content parsing, data cleaning, information extraction, field definition and storing with XML of corpus. It adapts to the construction of training and test corpus for text classing, topic identify and information

Key words: Content parsing, Information extraction, Corpus, XML

摘要: 围绕网页内容解析、数据清洗、语料库信息字段定义和XML数据存储4个方面,该文介绍了网页信息自动抽取及建库的原理,并使用C#语言在微软.NET Framework下完成了一个网页信息自动抽取及建库系统,该系统具有智能性和个性化的特点,适合构建文本分类、话题识别和信息检索的大型训练(测试)语料集。

关键词: 内容解析, 信息抽取, 语料库, XML

CLC Number: