Web Information Extraction and Corpus Construction System with C#

doi:10.3969/j.issn.1000-3428.2006.16.019

Computer Engineering ›› 2006, Vol. 32 ›› Issue (16): 49-51. doi: 10.3969/j.issn.1000-3428.2006.16.019

• Software Technology and Database • Previous Articles Next Articles

Web Information Extraction and Corpus Construction System with C#

LIU Hua

Department of Applied Linguistics, College of Chinese Language and Culture, Jinan University, Guangzhou 510610

Received:1900-01-01 Revised:1900-01-01 Online:2006-08-20 Published:2006-08-20

网页信息抽取及建库系统C#实现

刘华

暨南大学华文学院应用语言学系，广州 510610

Abstract

Abstract: This paper describes an intelligentized and individuation system for Web information extraction and corpus construction with C#. which includes Web pages content parsing, data cleaning, information extraction, field definition and storing with XML of corpus. It adapts to the construction of training and test corpus for text classing, topic identify and information

Key words: Content parsing, Information extraction, Corpus, XML

摘要： 围绕网页内容解析、数据清洗、语料库信息字段定义和XML数据存储4个方面，该文介绍了网页信息自动抽取及建库的原理，并使用C#语言在微软.NET Framework下完成了一个网页信息自动抽取及建库系统，该系统具有智能性和个性化的特点，适合构建文本分类、话题识别和信息检索的大型训练(测试)语料集。

关键词: 内容解析, 信息抽取, 语料库, XML

CLC Number:

TP311.12

LIU Hua. Web Information Extraction and Corpus Construction System with C#[J]. Computer Engineering, 2006, 32(16): 49-51.

刘华. 网页信息抽取及建库系统C#实现[J]. 计算机工程, 2006, 32(16): 49-51.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.3969/j.issn.1000-3428.2006.16.019

http://www.ecice06.com/EN/Y2006/V32/I16/49

[1]	HENG Hongjun, MIAO Jing. Joint Extraction of Binary Tagging Entity Relation for Enhanced Semantic and Syntactic Information [J]. Computer Engineering, 2023, 49(4): 77-84.
[2]	ZHANG Wenwen, XU Yang, BAI Rui, CHEN Na. Animal Pose Estimation Based on Improved Stacked Hourglass Network [J]. Computer Engineering, 2023, 49(2): 263-270.
[3]	FU Yeqiang, LI Junhui. Data Augmentation Method for AMR-to-Text Generation [J]. Computer Engineering, 2022, 48(5): 91-97.
[4]	ZHANG Jixiang, ZHANG Xiangsen, WU Changxu, ZHAO Zengshun. Survey of Knowledge Graph Construction Techniques [J]. Computer Engineering, 2022, 48(3): 23-37.
[5]	ZHANG Dong, WANG Mingtao, CHEN Wenliang. Named Entity Recognition Combining Wubi Glyphs with Contextualized Character Embeddings [J]. Computer Engineering, 2021, 47(3): 94-101.
[6]	ZHANG Junlian, ZHANG Yifan, WANG Mingquan, HUANG Yongjian. Joint Extraction of Chinese Entity Relations Based on Graph Convolutional Neural Network [J]. Computer Engineering, 2021, 47(12): 103-111.
[7]	GUO Biao, TANG Qi, WEN Zhimin, FU Juan, WANG Ling, WEI Jibo. DPR Software Architecture Design and Scheduling Technology for SCA [J]. Computer Engineering, 2021, 47(12): 221-229.
[8]	HE Yangyu, YAN Lei, YI Mianzhu, LI Hongxin. Named Entitiy Recognition Method for Laotian in Military Field Combining CRF and Rules [J]. Computer Engineering, 2020, 46(8): 297-304.
[9]	HE Zhuoheng, LIU Zhiyong, LI Lu, LI Changming, ZHANG Lin. Comparative Study of XML Parsing Methods in Heterogeneous Text Data Conversion [J]. Computer Engineering, 2020, 46(7): 286-293,299.
[10]	YIN Mingming, SHI Xiaojing, YU Hongfei, DUAN Xiangyu. Cross-Lingual Sentence Summarization System Based on Contrastive Attention Mechanism [J]. Computer Engineering, 2020, 46(5): 86-93.
[11]	CHEN Xi, ZHU Xiaodong, GAO Guangkuo, XIAO Fangxiong. Sentiment Analysis of Chinese Comments Based on Hybrid Vector Model [J]. Computer Engineering, 2020, 46(1): 309-314.
[12]	FENG Xu,HUA Qingyi,FAN Pan,WANG Wenjian. Design and Implementation of A Mobile Device User Interface Description Language [J]. Computer Engineering, 2019, 45(3): 73-77,90.
[13]	SAIMAITI Maimaitimin, ESMAEL Abdurehim. Research on Uyghur Stop Words Extraction Method [J]. Computer Engineering, 2019, 45(10): 288-292,300.
[14]	WANG Wenqi,LI Yong,GUAN Yunyun. Research on Text Information Depth Extraction and Multi-keyword Parallel Matching Technique [J]. Computer Engineering, 2018, 44(12): 281-287.
[15]	LI Yanqun,HE Yunqi,QIAN Longhua,ZHOU Guodong. Automatic Construction of Chinese Nested Named Entity Recognition Corpus Based on Wikipedia [J]. Computer Engineering, 2018, 44(11): 76-82.

Please choose a citation manager

Content to export

Web Information Extraction and Corpus Construction System with C#

网页信息抽取及建库系统C#实现

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments

模态框（Modal）标题

Please choose a citation manager

Content to export

Web Information Extraction and Corpus Construction System with C#

网页信息抽取及建库系统C#实现

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments