摘要: 当被插入或删除的字节接近于匹配失败数据段两侧时,会导致SBBS算法回溯功能局部甚至完全失效。为此,提出一种改进的重复数据检测算法。采用滑动与滚动相结合的窗口移动模式减少窗口计算量,利用Rsync滚动校验和算法与MD5算法优化窗口计算模式,加快匹配速度。
通过回溯匹配失败数据段,检测其中的重复数据段,以提升重复数据的检测精度。实验结果表明,与SBBS算法相比,该算法在重复数据段均匀分布与非均匀分布时的查全率分别提高约4.32%和5.28%。
关键词:
重复数据检测,
匹配失败数据段,
SBBS算法,
窗口计算,
校验和算法,
回溯
Abstract: The backtracking function of Sliding Blocking Algorithm with Backtracking Sub-block(SBBS) becomes partial or even complete failure when the bytes are inserted or deleted,which are close to the two sides of the matching failure data segments.Aiming at the problem,this paper proposes an improved algorithm for duplicate data detection.This algorithm takes advantage of a new movement pattern which combines with gliding and rolling to reduce the computation overhead of window,and it improves the matching speed by adopting Rsync rolling checksum algorithm and MD5 algorithm.By backtracking matching failure data segments,it detects contained duplicate data segments to improve the detection accuracy of duplicate data.Experimental results show that,compared with SBBS,the recall ratio of this algorithm in uniform distribution and non uniform distribution of duplicate data segments are increased by 4.32% and 5.28% respectively.
Key words:
duplicate data detection,
matching failure data segment,
Sliding Blocking Algorithm with Backtracking Sub-block(SBBS),
window calculation,
checksum algorithm,
backtracking
中图分类号: