计算机工程

• 软件技术与数据库 • 上一篇    下一篇

一种基于滑动分块的重复数据检测算法

郑亚光,潘久辉   

  1. (暨南大学信息科学技术学院,广州 510632)
  • 收稿日期:2015-01-23 出版日期:2016-02-15 发布日期:2016-01-29
  • 作者简介:郑亚光(1990-),男,硕士研究生,主研方向为数据清洗、信息集成;潘久辉,教授、博士生导师。
  • 基金项目:
    公安部技术研究计划基金资助项目(2014JSYJB048);武汉大学软件工程国家重点实验室开放基金资助项目(SKLSE2012-09-37)。

A Duplicate Data Detection Algorithm Based on Sliding Blocking

ZHENG Yaguang,PAN Jiuhui   

  1. (College of Information Science and Technology,Jinan University,Guangzhou 510632,China)
  • Received:2015-01-23 Online:2016-02-15 Published:2016-01-29

摘要: 当被插入或删除的字节接近于匹配失败数据段两侧时,会导致SBBS算法回溯功能局部甚至完全失效。为此,提出一种改进的重复数据检测算法。采用滑动与滚动相结合的窗口移动模式减少窗口计算量,利用Rsync滚动校验和算法与MD5算法优化窗口计算模式,加快匹配速度。 通过回溯匹配失败数据段,检测其中的重复数据段,以提升重复数据的检测精度。实验结果表明,与SBBS算法相比,该算法在重复数据段均匀分布与非均匀分布时的查全率分别提高约4.32%和5.28%。

关键词: 重复数据检测, 匹配失败数据段, SBBS算法, 窗口计算, 校验和算法, 回溯

Abstract: The backtracking function of Sliding Blocking Algorithm with Backtracking Sub-block(SBBS) becomes partial or even complete failure when the bytes are inserted or deleted,which are close to the two sides of the matching failure data segments.Aiming at the problem,this paper proposes an improved algorithm for duplicate data detection.This algorithm takes advantage of a new movement pattern which combines with gliding and rolling to reduce the computation overhead of window,and it improves the matching speed by adopting Rsync rolling checksum algorithm and MD5 algorithm.By backtracking matching failure data segments,it detects contained duplicate data segments to improve the detection accuracy of duplicate data.Experimental results show that,compared with SBBS,the recall ratio of this algorithm in uniform distribution and non uniform distribution of duplicate data segments are increased by 4.32% and 5.28% respectively.

Key words: duplicate data detection, matching failure data segment, Sliding Blocking Algorithm with Backtracking Sub-block(SBBS), window calculation, checksum algorithm, backtracking

中图分类号: