Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering

Previous Articles     Next Articles

Optimization Scheme of Small File in Cloud Storage System Based on HDFS

ZOU Zhenyu,ZHENG Quan,WANG Song,YANG Jian   

  1. (Department of Automation,University of Science and Technology of China,Hefei 230027,China)
  • Received:2015-04-14 Online:2016-03-15 Published:2016-03-15

基于HDFS的云存储系统小文件优化方案

邹振宇,郑烇,王嵩,杨坚   

  1. (中国科学技术大学自动化系,合肥 230027)
  • 作者简介:邹振宇(1990-),男,硕士研究生,主研方向为云存储系统、网络传播与控制;郑烇,副教授、博士;王嵩,讲师、博士;杨坚,副教授、博士生导师。
  • 基金资助:

    国家自然科学基金资助项目(61174062)。

Abstract:

Hadoop Distributed File System(HDFS) has excellent features like high fault-tolerant,scalable,low-cost,which is widely used in the present large data storage and analysis scenarios.But for the mass small files storage,HDFS exposes defects like high memory consumption and high latency access.For “Hefei City Cloud” system’s “upload once download many” feature,this paper proposes an optimization scheme based on small files’ attributes.According to the correlation between files,it defines the priorities.For files smaller than 5 MB,it first merges them by priority level and then uploads,to generate an index record.Combining radomization thought,it uses two level caching policy to improve access efficiency.Meanwhile,the system regularly checks the access log,according to the habits of users,adjusting the prefetching factor dynamically.Experimental results show that the proposed scheme can effectively improve the efficiency of small files access,reduce memory usage of NameNodes and DataNodes.It can significantly improve the interaction of the system for storing and accessing mass small files.

Key words: Hadoop Distributed File System(HDFS), small file, prefetching, randomization, dynamically adjusting

摘要:

Hadoop分布式文件系统(HDFS)具有高容错、可伸缩、廉价存储等优良特性,在大数据存储和分析场景中得到广泛应用。但对于海量小文件存储,HDFS存在高内存消耗、高延迟访问等缺陷。为此,结合 “合肥城市云”系统“一次上传,多次下载”的特性,提出一种基于小文件属性的优化方案。根据文件之间的相关性设定优先级,对小于5 MB的文件按优先级高低合并后再上传,并生成索引记录。结合随机化思想,采用两级缓存策略,将预提取数据缓存在内存池中,提高访问效率。同时,系统定期查询访问日志,根据用户访问习惯,动态调整预提取因子的大小。实验结果表明,该方案能有效提高小文件访问效率,降低名字节点和数据节点的内存开销,在有海量小文件存取的情况下提升系统的交互性。

关键词: Hadoop分布式文件系统, 小文件, 预提取, 随机化, 动态调整

CLC Number: