作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2008, Vol. 34 ›› Issue (5): 274-276. doi: 10.3969/j.issn.1000-3428.2008.05.096

• 开发研究与设计技术 • 上一篇    下一篇

互联网商品信息抽取技术

于鲁波1,陈 超2   

  1. (1. 中国科学技术大学电子工程与信息科学系,合肥 230027;2. 多媒体计算与通信教育部微软重点实验室,合肥 230026)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-03-05 发布日期:2008-03-05

WWW Merchandise Information Extraction

YU Lu-bo1, CHEN Chao2   

  1. (1. Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230027; 2. MOE-Microsoft Key Laboratory of Multimedia Computing and Communication, Hefei 230026)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-03-05 Published:2008-03-05

摘要: 针对网页信息抽取中格式多样化的问题,提出一种基于路径统计聚类的信息抽取算法。该算法充分利用电子商务网站网页的特点,给出网页统计信息的一般数学表达式,在此基础上,采用基于统计聚类的思想,分割信息块,实现抽取信息。通过对实际电子商务网站网页信息的抽取,证明算法的有效性,分割正确率达92.27%,信息抽取正确率达98.24%。

关键词: 网页分割, 网页信息抽取, 包装器, 路径聚类

Abstract: In response to format diversity problem in the webpage information extraction, this paper proposes a new information extraction method based on XPATH clustering. The method utilizes the character of e-commerce website and gives a general mathematic formula. Based on it, this paper uses the thought of webpage statistical information clustering, segments the information block, and realizes the information extraction. This paper proves the validity of the algorithm through the practical website information extraction, achieves good results. Segmentation accuracy is 92.27%, and information extraction accuracy gets 98.24%.

Key words: Web page segmentation, Web page information extraction, wrapper, XPATH clustering

中图分类号: