Optimization Scheme of Small File in Cloud Storage System Based on HDFS

doi:10.3969/j.issn.1000-3428.2016.03.010

Abstract

Abstract:

Hadoop Distributed File System(HDFS) has excellent features like high fault-tolerant,scalable,low-cost,which is widely used in the present large data storage and analysis scenarios.But for the mass small files storage,HDFS exposes defects like high memory consumption and high latency access.For “Hefei City Cloud” system’s “upload once download many” feature,this paper proposes an optimization scheme based on small files’ attributes.According to the correlation between files,it defines the priorities.For files smaller than 5 MB,it first merges them by priority level and then uploads,to generate an index record.Combining radomization thought,it uses two level caching policy to improve access efficiency.Meanwhile,the system regularly checks the access log,according to the habits of users,adjusting the prefetching factor dynamically.Experimental results show that the proposed scheme can effectively improve the efficiency of small files access,reduce memory usage of NameNodes and DataNodes.It can significantly improve the interaction of the system for storing and accessing mass small files.

Key words: Hadoop Distributed File System(HDFS), small file, prefetching, randomization, dynamically adjusting

摘要：

Hadoop分布式文件系统(HDFS)具有高容错、可伸缩、廉价存储等优良特性,在大数据存储和分析场景中得到广泛应用。但对于海量小文件存储,HDFS存在高内存消耗、高延迟访问等缺陷。为此,结合 “合肥城市云”系统“一次上传,多次下载”的特性,提出一种基于小文件属性的优化方案。根据文件之间的相关性设定优先级,对小于5 MB的文件按优先级高低合并后再上传,并生成索引记录。结合随机化思想,采用两级缓存策略,将预提取数据缓存在内存池中,提高访问效率。同时,系统定期查询访问日志,根据用户访问习惯,动态调整预提取因子的大小。实验结果表明,该方案能有效提高小文件访问效率,降低名字节点和数据节点的内存开销,在有海量小文件存取的情况下提升系统的交互性。

关键词: Hadoop分布式文件系统, 小文件, 预提取, 随机化, 动态调整

CLC Number:

TP391

ZOU Zhenyu,ZHENG Quan,WANG Song,YANG Jian. Optimization Scheme of Small File in Cloud Storage System Based on HDFS[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2016.03.010.

邹振宇,郑烇,王嵩,杨坚. 基于HDFS的云存储系统小文件优化方案[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2016.03.010.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.3969/j.issn.1000-3428.2016.03.010

http://www.ecice06.com/EN/Y2016/V42/I3/34

References

参考文献［1］The Apache Software Foundation.Hadoop［EB/OL］.［20150413］.http://hadoop.apache.org. ［2］White T.Hadoop:The Definitive Guide［M］.［S.l］:O’Reilly Media,Inc.,2012. ［3］Dong Xicheng.Hadoop HDFS［EB/OL］.［20150413］.http://dongxicheng.org/. ［4］White T.The Small Files Problem［EB/OL］.［20150413］.http://blog.cloudera.com/blog/2009/02/thesmallfilesproblem. ［5］Liu Xuhui,Han Jizhong,Zhong Yunqin,et al.Implementing WebGIS on Hadoop:A Case Study of Improving Small File I/O Performance on HDFS［C］//Proceedings of IEEE Conference on Cluster Computing.Washington D.C.,USA:IEEE Press,2009:1 8. ［6］Dong Bo,Qiu Jie,Zheng Qinghua,et al.A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop:A Case Study by PowerPoint Files［C］//Proceedings of International Conference on Services Computing.Washington D.C.,USA:IEEE Press,2010:6572. ［7］张春明,芮建武,何婷婷.一种 Hadoop 小文件存储和读取的方法［J］.计算机应用与软件,2012,29(11):96100. (下转第46页) (上接第40页) ［8］Chandrasekar S,Dakshinamurthy R,Seshakumar P G,et al.A Novel Indexing Scheme for Efficient Handling of Small Files in Hadoop Distributed File System［C］//Proceedings of International Conference on Computer Communication and Informatics.Washington D.C.,USA:IEEE Press,2013:18. ［9］Dong Bo,Zheng Qinghua,Tian Feng,et al.An Optimized Approach for Storing and Accessing Small Files on Cloud Storage［J］.Journal of Network and Computer Applications,2012,35(9):18471862. ［10］Cui Tianyi.背包问题九讲［EB/OL］.［20150413］.http://loveoriented.com/pack/. ［11］蔡斌,陈湘萍.Hadoop技术内幕:深入解析Hadoop Common和HDFS架构设计与原理实现［M］.北京:机械工业出版社,2013. ［12］Shuo Zhang,Li Miao,Dafang Zhang,et al.A Strategy to Deal with Mass Small Files in HDFS［C］//Proceedings of International Conference on Intelligent Humanmachine Systems and Cybernetics.Washington D.C.,USA:IEEE Press,2014:331 334. ［13］Yang Zhang,Dan Liu.Improving the Efficiency of Storing for Small Files in HDFS［C］//Proceedings of International Conference on Computer Science and Service System.Washingtong D.C.,USA:IEEE Press,2012:22392242. ［14］Li Jia,Lin Kunhui,Wang Jingjin.Design of the Mass Multimedia Files Storage Architecture Based on Hadoop［C］//Proceedings of the 8th International Conference on Computer Science & Education.Washingtong D.C.,USA:IEEE Press,2013:801 804. ［15］Zhang Shuo,Miao Li,Zhang Dafang,et al.A Strategy to Deal with Mass Small Files in HDFS［C］//Proceedings of the 6th International Conference on Intelligent Humanmachine Systems and Cybernetics.Washington D.C.,USA:IEEE Press,2014:331334. 编辑金胡考

[1]	WEI Xiuran, WANG Feng. Data Replication Strategy of Cloud Storage Based on Coordinator and Genetic Algorithm [J]. Computer Engineering, 2021, 47(8): 124-130,139.
[2]	CONG Mou, ZHANG Ping, WANG NING. Method of Timing Attack for Linux Against KASLR [J]. Computer Engineering, 2021, 47(8): 177-182.
[3]	WANG Jinhan,LI Jun,LU Dongdong,ZHANG Hailong,ZHU Ying. Hardware prefetching mechanism based on double step data stream [J]. Computer Engineering, 2019, 45(6): 115-118,126.
[4]	WANG Yingjun,FU Jianming,JIANG Baihe. Cross-site Request Forgery Defense Method Based on Randomization Parameter Name [J]. Computer Engineering, 2018, 44(11): 158-164.
[5]	AN Likui,HAN Liyan. Cache WCET Analysis Method with Instruction Prefetching on Multi-cores [J]. Computer Engineering, 2018, 44(10): 85-94,100.
[6]	GAO Yuan,REN Sheng,GU Wenjie. Design and Implementation of HDFS Data Block Scheduling Algorithm in Heterogeneous Environment [J]. Computer Engineering, 2017, 43(8): 82-89.
[7]	ZHU Jiazhou,SHAO Peinan,CHEN Jing. Research on Distributed Parallel Computing Processing Platform Architecture for Image Data [J]. Computer Engineering, 2017, 43(5): 60-66,74.
[8]	YAO Min,YIN Jianwei,TANG Yan,LUO Zhiling. Distributed Backup Data Deduplication System Based on Data Routing [J]. Computer Engineering, 2017, 43(2): 85-91.
[9]	WU Guojin,HU Cheng. A Metadata Prefetching Strategy Based on Provenance Information [J]. Computer Engineering, 2016, 42(6): 1-6.
[10]	JIA Xun,WENG Zhiqiang,HU Xiangdong. Multiple Level Hardware Prefetching Based on Stream Access Features [J]. Computer Engineering, 2016, 42(1): 51-55.
[11]	BO Luo,ZHAO Gangyao. Route Planning Algorithm Based on MapReduce and Ant Colony Optimization [J]. Computer Engineering, 2015, 41(5): 38-44,55.
[12]	LIU Hao-yang, ZHU Yong-xin. Design and Implementation of Small File System for Reconfigurable Cloud Computing [J]. Computer Engineering, 2014, 40(4): 1-6.
[13]	ZHOU Shi-Hui, YAN Jian. Parallel Web Log Mining Algorithm in Hadoop Platform [J]. Computer Engineering, 2013, 39(6): 43-46.
[14]	QIN Hai-Sheng, ZHANG Lei, FENG Yan-Jiang, TUN Wen-Dun, HE Chuan-Bei. Design of Randomized Proxy Blind Signcryption Scheme Without Paring Computing [J]. Computer Engineering, 2013, 39(4): 169-172.
[15]	DENG Feng, LI Mei-Yi, HE Cheng. Research on Namenode Single Point of Fault Solution [J]. Computer Engineering, 2012, 38(21): 40-44.

Please choose a citation manager

Content to export