摘要: 围绕网页内容解析、数据清洗、语料库信息字段定义和XML数据存储4个方面,该文介绍了网页信息自动抽取及建库的原理,并使用C#语言在微软.NET Framework下完成了一个网页信息自动抽取及建库系统,该系统具有智能性和个性化的特点,适合构建文本分类、话题识别和信息检索的大型训练(测试)语料集。
关键词:
内容解析,
信息抽取,
语料库,
XML
Abstract: This paper describes an intelligentized and individuation system for Web information extraction and corpus construction with C#. which includes Web pages content parsing, data cleaning, information extraction, field definition and storing with XML of corpus. It adapts to the construction of training and test corpus for text classing, topic identify and information
Key words:
Content parsing,
Information extraction,
Corpus,
XML
中图分类号:
刘 华. 网页信息抽取及建库系统C#实现[J]. 计算机工程, 2006, 32(16): 49-51.
LIU Hua. Web Information Extraction and Corpus Construction System with C#[J]. Computer Engineering, 2006, 32(16): 49-51.