互联网商品信息抽取技术

doi:10.3969/j.issn.1000-3428.2008.05.096

计算机工程 ›› 2008, Vol. 34 ›› Issue (5): 274-276. doi: 10.3969/j.issn.1000-3428.2008.05.096

互联网商品信息抽取技术

于鲁波1，陈超2

(1. 中国科学技术大学电子工程与信息科学系，合肥 230027；2. 多媒体计算与通信教育部微软重点实验室，合肥 230026)

收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-03-05 发布日期:2008-03-05

WWW Merchandise Information Extraction

YU Lu-bo1, CHEN Chao2

(1. Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230027; 2. MOE-Microsoft Key Laboratory of Multimedia Computing and Communication, Hefei 230026)

Received:1900-01-01 Revised:1900-01-01 Online:2008-03-05 Published:2008-03-05

摘要/Abstract

摘要： 针对网页信息抽取中格式多样化的问题，提出一种基于路径统计聚类的信息抽取算法。该算法充分利用电子商务网站网页的特点，给出网页统计信息的一般数学表达式，在此基础上，采用基于统计聚类的思想，分割信息块，实现抽取信息。通过对实际电子商务网站网页信息的抽取，证明算法的有效性，分割正确率达92.27%，信息抽取正确率达98.24%。

关键词: 网页分割, 网页信息抽取, 包装器, 路径聚类

Abstract: In response to format diversity problem in the webpage information extraction, this paper proposes a new information extraction method based on XPATH clustering. The method utilizes the character of e-commerce website and gives a general mathematic formula. Based on it, this paper uses the thought of webpage statistical information clustering, segments the information block, and realizes the information extraction. This paper proves the validity of the algorithm through the practical website information extraction, achieves good results. Segmentation accuracy is 92.27%, and information extraction accuracy gets 98.24%.

Key words: Web page segmentation, Web page information extraction, wrapper, XPATH clustering

中图分类号:

TP391

于鲁波;陈超. 互联网商品信息抽取技术[J]. 计算机工程, 2008, 34(5): 274-276.

YU Lu-bo; CHEN Chao. WWW Merchandise Information Extraction[J]. Computer Engineering, 2008, 34(5): 274-276.

http://www.ecice06.com/CN/Y2008/V34/I5/274

[1]	王辉,郁波,洪宇,肖仰华. 基于知识图谱的Web信息抽取系统[J]. 计算机工程, 2017, 43(6): 118-124.
[2]	蔡偃武,高大启,阮彤,蒋锐权. 面向大规模数据的在线新事件检测[J]. 计算机工程, 2014, 40(10): 37-42.
[3]	周家晶, 邹翔, 沈备军, 胡善学. Web遗留系统的服务包装器环境设计[J]. 计算机工程, 2011, 37(19): 73-75.
[4]	刘云峰. 基于标签路径聚类的文本信息抽取算法[J]. 计算机工程, 2010, 36(12): 83-84.
[5]	杨晓琴, 鞠时光, 曹庆皇, 王秀红. 基于包装器的Deep Web自动语义标注[J]. 计算机工程, 2010, 36(12): 52-54.
[6]	陈明;孙丽丽. 基于WAP的移动搜索模型[J]. 计算机工程, 2008, 34(3): 205-206,.
[7]	刘辉;陈静玉;徐学洲. 基于模板流程配置的Web信息抽取[J]. 计算机工程, 2008, 34(20): 55-57.

选择文件类型/文献管理软件名称

选择包含的内容

互联网商品信息抽取技术

WWW Merchandise Information Extraction

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

互联网商品信息抽取技术

WWW Merchandise Information Extraction

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

编辑推荐

Metrics

本文评价