摘要: 提出一种针对电子商务网站商品列表页数据记录的自动抽取方法。该方法根据商品记录的特点,通过商品记录中商品的文本、图片以及布局等节点类型信息计算节点对应的值,依据节点值的相似度对节点进行分组,再从不同分组中过滤出包含数据记录节点的集合,从而抽取整个页面的数据记录。实验结果证明该方法有效且抽取效率较高。
关键词:
Web信息抽取,
数据抽取,
信息集成,
商品数据记录
Abstract: This paper proposes an automatic extraction method for Product Data Record(PDR) of list page on Ecommerce website. According to the characteristics of the product records, it calculates value for each node in the DOM tree of page by the node type information of text, image, layout and so on, classifies these nodes according to their similarity of value, and gets the final node collection which contains data record, so that the data records of the whole page are extracted. Experimental results show that the method is effective and with high efficiency.
Key words:
Web information extraction,
data extraction,
information integration,
Product Data Record(PDR)
中图分类号:
杨舟, 卓林, 赵朋朋, 崔志明. 一种针对商品数据记录的自动抽取方法[J]. 计算机工程, 2010, 36(23): 262-265.
YANG Zhou, ZHUO Lin, DIAO Peng-Peng, CUI Zhi-Meng. Automatic Extraction Method for Product Data Records[J]. Computer Engineering, 2010, 36(23): 262-265.