计算机工程 ›› 2009, Vol. 35 ›› Issue (3): 23-25,2.doi: 10.3969/j.issn.1000-3428.2009.03.009

• 软件技术与数据库 • 上一篇    下一篇

结构化信息的去重方法

李 林,刘桂峰,赵朋朋,崔志明   

  1. (苏州大学智能信息处理及应用研究所,苏州 215006)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-02-05 发布日期:2009-02-05

Duplication Deletion Method for Structural Information

LI Lin, LIU Gui-feng, ZHAO Peng-peng, CUI Zhi-ming   

  1. (Institute of Intelligent Information Processing and Application, Soochow University, Suzhou 215006)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-02-05 Published:2009-02-05

摘要: 针对载有结构化信息的网页,提出一种基于学习的去重方法。通过先期准备的样本定义分类器,根据分类器对页面中结构化信息的不同属性字段进行分类和距离计算,计算出整个信息对象和已分类样本信息的距离,以这些距离与阈值的大小关系判断该信息对象是否重复。

关键词: 相似性测度, 去重, 聚类

Abstract: This paper proposes a learning-based duplication deletion method for structural information on Web. It prepares a training set for producing classifier, classifies different attribute fields of structured information in pages, and computes the distances according to the classifier. The distance between the whole information object and classified sample information is computed, and whether the record is duplicate by comparing with threshold is judged.

Key words: similarity measure, duplication deletion, clustering

中图分类号: