计算机工程

• 先进计算与数据处理 • 上一篇    下一篇

基于数据路由的分布式备份数据去重系统

姚敏,尹建伟,唐彦,罗智凌   

  1. (浙江大学 计算机科学与技术学院,杭州 310027)
  • 收稿日期:2016-03-07 出版日期:2017-02-15 发布日期:2017-02-15
  • 作者简介:姚敏(1989—),男,硕士研究生,主研方向为数据去重、分布式存储;尹建伟,教授、博士生导师;唐彦、罗智凌,博士研究生。
  • 基金项目:
    国家科技支撑计划项目“现代服务业跨界服务共性技术体系研发与示范应用”(2013AA01A213)。

Distributed Backup Data Deduplication System Based on Data Routing

YAO Min,YIN Jianwei,TANG Yan,LUO Zhiling   

  1. (College of Computer Science and Technology,Zhejiang University,Hangzhou 310027,China)
  • Received:2016-03-07 Online:2017-02-15 Published:2017-02-15

摘要: 传统数据去重备份系统在大数据应用场景下存在备份存储空间过大和数据吞吐量不足等缺点。为此,基于数据路由设计一种分布式备份数据去重系统。该系统以数据片为去重粒度,具有数据路由和数据预取2个功能。数据路由使用布隆过滤器对需要处理的数据片进行路由查询,数据预取则使用平均取样和基于Jaccard距离的近邻取样方案。通过数据路由分配数据片到相应处理节点进行处理,平均取样得到的数据片哈希码为数据路由提供路由信息,近邻取样得到的数据片哈希码用于系统首次数据去重。实验结果表明,该系统在保证数据去重率的同时,相对全节点查询和定点路由的数据片路由方式数据吞吐量提升明显。

关键词: 数据去重, 数据路由, 数据预取, 布隆过滤器, Jaccard距离

Abstract: In big data scenarios,traditional data deduplication backup system faces with defects like large data backup storage space,insufficient data throughput and so on.Aiming at these defects,this paper designs a distributed backup data dedeplication system based on data routing.It uses data chunk as deduplication granularity,whose functions involve data routing and data prefetching.Data routing uses the Bloom filter to query data chunks to be processed,and applies average sampling and neighbor sampling based on Jaccard distance to prefetch data chunks.This system uses data routing to assign data chunks to the corresponding processing nodes to deal with.Data chunks’ hash code obtained through average sampling provides routing information for data routing.And data chunks’ hash code obtained through neighbor sampling is used for the first data deduplication of the system.Experimental results show that the data throughput of this system increases significantly compared with all processing node query and fixed data routing,while maintaining the deduplication ratio.

Key words: data deduplication, data routing, data prefetching, Bloom filter, Jaccard distance

中图分类号: