基于RDD重用度的Spark自适应缓存优化策略

doi:10.19678/j.issn.1000-3428.0068760

摘要/Abstract

摘要：

基于内存进行作业计算的Spark分布式计算框架并不考虑作业的中间计算结果, 容易造成高频访问的数据块丢失, 在迭代作业类型中表现更为明显。Spark通过LinkedHashMap提供的哈希表实现最近最少使用(LRU)算法的缓存功能, 最久未被使用的元素被移动到顶部并优先被删除, 且造成数据重算。针对Spark使用的LRU缓存替换算法造成的高频访问但当前未被使用的热点数据被替换出缓存的问题, 提出一种基于弹性分布式数据集(RDD)重用度的Spark自适应缓存优化策略(LCRD), 该策略包括自动缓存算法和缓存自动清理算法。首先, 自动缓存算法在作业执行前对Spark的有向无环图(DAG)进行分析, 计算RDD的重用频率、RDD的算子复杂度等数据, 并对影响执行效率的相关因素进行量化, 根据重用度模型进行计算, 在作业执行中, 应用程序将重用度较高的数据块进行缓存; 其次, 在发生内存瓶颈或RDD缓存无效时, 缓存自动清理算法遍历缓存队列, 并对低频访问的数据块进行清理。实验结果表明, 在选取amazon0302、email-EuAll、web-Google、wiki-Talk等4种公开数据集执行PageRank迭代作业时, 与LRU相比, LCRD的执行效率平均分别提升10.7%、8.6%、17.9%和10.6%, 内存利用率平均分别提升3%、4%、3%和5%。所提策略能够有效提高Spark的执行效率, 同时提升内存利用率。

关键词: 并行计算, Spark框架, 缓存替换, 最近最少使用算法, 大数据

Abstract:

The Spark distributed computing framework for job computation based on memory does not consider the intermediate computation results of jobs, leading to the loss of data blocks with high-frequency access, which is especially evident in iterative job types. Spark realizes the caching function of the Least Recently Used (LRU) algorithm through a hash table provided by LinkedHashMap; the elements that have not been used for the longest time are moved to the top and deleted first, causing data recalculation. To address the issue of high-frequency access and unused hot data being replaced from the cache by the LRU cache replacement algorithm used in Spark, this paper proposes a Spark adaptive caching optimization strategy based on Resilient Distributed Dataset (RDD) reuse degree, named LCRD. It includes automatic caching and cache automatic cleaning algorithms. First, the automatic caching algorithm analyzes the Directed Acyclic Graph (DAG) of Spark before job execution, calculates the reuse frequency and operator complexity of RDD, and quantifies the factors affecting execution efficiency based on the reuse degree model. During job execution, the application caches data blocks with a higher reuse degree. Second, in the case of memory bottlenecks or invalid RDD caching, the automatic cache cleaning algorithm traverses the cache queue and cleans the data blocks with low-frequency access. Experimental results indicate that, compared to LRU, when executing PageRank iterations on four public datasets (amazon0302, email-EuAll, web-Google, and wiki-Talk), the efficiency of LCRD improves by 10.7%, 8.6%, 17.9%, and 10.6%, respectively. The average increases in memory utilization are 3%, 4%, 3%, and 5%, respectively. This proposed strategy effectively enhances Spark execution efficiency and improves memory utilization.

Key words: parallel computing, Spark framework, cache replacement, Least Recently Used (LRU) algorithm, big data

潘顺杰, 于俊洋, 王龙葛, 李涵, 翟锐. 基于RDD重用度的Spark自适应缓存优化策略[J]. 计算机工程, 2025, 51(7): 190-198.

PAN Shunjie, YU Junyang, WANG Longge, LI Han, ZHAI Rui. Spark Adaptive Cache Optimization Strategy Based on the Reuse Degree of RDD[J]. Computer Engineering, 2025, 51(7): 190-198.

https://www.ecice06.com/CN/Y2025/V51/I7/190

图/表 9

图1 Spark运行架构

Fig.1 Spark operation architecture

图2 统一内存管理

Fig.2 Unified memory management

图3 Spark作业有向无环图

Fig.3 Spark job directed acyclic graph

图4 RDD结构树

Fig.4 RDD structure tree

图5 作业执行时间对比

Fig.5 Job execution time comparison

图6 作业内存利用率对比

Fig.6 Job memory utilization comparison

图7 作业CPU利用率对比

Fig.7 Comparison of job CPU utilization

参考文献 26

1	WANG J J , YU J Y , ZHAI R , et al. GMPR: a two-phase heuristic algorithm for virtual machine placement in large-scale cloud data centers. IEEE Systems Journal, 2022, 17 (1): 1419- 1430.
2	丁家满, 李海滨, 邓斌, 等. 一种基于Spark的频繁项集快速挖掘算法. 软件学报, 2023, 34 (5): 2446- 2464.
	DING J M , LI H B , DENG B , et al. Fast mining algorithm of frequent itemset based on Spark. Journal of Software, 2023, 34 (5): 2446- 2464.
3	SALLOUM S , DAUTOV R , CHEN X J , et al. Big data analytics on Apache Spark. International Journal of Data Science and Analytics, 2016, 1 (3): 145- 164.
4	AHMED N , BARCZAK A L C , SUSNJAK T , et al. A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench. Journal of Big Data, 2020, 7 (1): 110.
5	钱文君, 沈晴霓, 吴鹏飞, 等. 大数据计算环境下的隐私保护技术研究进展. 计算机学报, 2022, 45 (4): 669- 701.
	QIAN W J , SHEN Q N , WU P F , et al. Research progress on privacy-preserving techniques in big data computing environment. Chinese Journal of Computers, 2022, 45 (4): 669- 701.
6	KANG M , LEE J G . An experimental analysis of limitations of MapReduce for iterative algorithms on Spark. Cluster Computing, 2017, 20 (4): 3593- 3604.
7	KANG M , LEE J G . Effect of garbage collection in iterative algorithms on Spark: an experimental analysis. The Journal of Supercomputing, 2020, 76 (9): 7204- 7218.
8	卞琛, 修位蓉, 于炯. 异构Spark集群数据倾斜修正调度策略. 计算机工程与科学, 2022, 44 (4): 620- 630.
	BIAN C , XIU W R , YU J . A data skew correction scheduling strategy of heterogeneous Spark cluster. Computer Engineering & Science, 2022, 44 (4): 620- 630.
9	ADINEW D M, ZHOU S J, LIAO Y J. Spark performance optimization analysis in memory management with deploy mode in standalone cluster computing[EB/OL]. [2023-09-05]. https://conferences.computer.org/icde/2020/pdfs/ICDE2020-5acyuqhpJ6L9P042wmjY1p/290300c049/290300c049.pdf.
10	夏立斌, 刘晓宇, 姜晓巍, 等. 基于分布式数据集的并行计算框架内存优化方法. 计算机工程, 2023, 49 (4): 43- 51. doi: 10.19678/j.issn.1000-3428.0066025
	XIA L B , LIU X Y , JIANG X W , et al. Memory optimization method for parallel computing framework based on distributed dataset. Computer Engineering, 2023, 49 (4): 43- 51. doi: 10.19678/j.issn.1000-3428.0066025
11	RANG W, YANG D L, CHENG D Z. A shared memory cache layer across multiple executors in Apache Spark[EB/OL]. [2023-09-05]. http://icslab.whu.edu.cn/src/paper/A_Shared_Memory_Cache_Layer_across_Multiple_Executors_in_Apache_Spark.pdf.
12	WANG B, TANG J, ZHANG R, et al. Energy-efficient data caching framework for Spark in hybrid DRAM/NVM memory architectures[EB/OL]. [2023-09-05]. https://ieeexplore.ieee.org/document/8855663.
13	KEFEI C, ZHAO L, KE Z, et al. A Spark join algorithm based on memory monitoring and batch processing[EB/OL]. [2023-09-05]. https://ieeexplore.ieee.org/document/8663752.
14	NIU D J, CHEN B, CAI T, et al. The classified and active caching strategy for iterative application in Spark[EB/OL]. [2023-09-05]. https://ieeexplore.ieee.org/document/8487381.
15	NASU A, YONEO K, OKITA M, et al. Transparent in-memory cache management in Apache Spark based on post-mortem analysis[EB/OL]. [2023-09-05]. https://ieeexplore.ieee.org/document/9006590.
16	XU Y G, LIU L, DING Z J. DAG-aware joint task scheduling and cache management in Spark clusters[EB/OL]. [2023-09-05]. https://ieeexplore.ieee.org/document/9139887.
17	SONG Y X , YU J Y , WANG J J , et al. Memory management optimization strategy in Spark framework based on less contention. The Journal of Supercomputing, 2023, 79 (2): 1504- 1525.
18	WANG B , TANG J , ZHANG R , et al. A task-aware fine-grained storage selection mechanism for in-memory big data computing frameworks. International Journal of Parallel Programming, 2021, 49 (1): 25- 50.
19	XIU W R, GUO J, LI Y R. Coordinate cache management for performance improvement in Spark[EB/OL]. [2023-09-05]. https://ieeexplore.ieee.org/document/9114961.
20	JIANG K , DU S F , ZHAO F , et al. Effective data management strategy and RDD weight cache replacement strategy in Spark. Computer Communications, 2022, 194, 66- 85.
21	INAGAKI H, KAWASHIMA R, MATSUO H. Improving Apache Spark's cache mechanism with LRC-based method using Bloom filter[EB/OL]. [2023-09-05]. https://ieeexplore.ieee.org/abstract/document/8590950.
22	GENG Y Z , SHI X H , PEI C , et al. LCS: an efficient data eviction strategy for Spark. International Journal of Parallel Programming, 2017, 45 (6): 1285- 1297.
23	WANG B, TANG J, ZHANG R, et al. LCRC: a dependency-aware cache management policy for Spark[EB/OL]. [2023-09-05]. https://ieeexplore.ieee.org/document/8672319.
24	YU Y H, WANG W, ZHANG J, et al. LRC: dependency-aware cache management for data analytics clusters[EB/OL]. [2023-09-05]. https://ieeexplore.ieee.org/document/8057007.
25	ZHAO Y , DONG J , LIU H W , et al. Improving cache management with redundant RDDs eviction in Spark. Computers, Materials & Continua, 2021, 68 (1): 727- 741.
26	LI H , JI S P , ZHONG H , et al. LPW: an efficient data-aware cache replacement strategy for Apache Spark. Science China Information Sciences, 2022, 66 (1): 112104.

[1]	张文帅, 李会民, 李京, 潘必才. 一种集成于超算作业调度系统应用的并行参数优化方法[J]. 计算机工程, 2025, 51(7): 59-67.
[2]	杨太龙, 赵红朋, 张磊. 基于国产异构平台的奇异值分解法[J]. 计算机工程, 2024, 50(9): 216-225.
[3]	张磊, 赵光岳, 肖超恩, 王建新. Falcon后量子算法的密钥树生成部件GPU并行优化设计与实现[J]. 计算机工程, 2024, 50(9): 208-215.
[4]	周小华, 周园春, 孟珍, 王学志. 大规模开放遥感影像地图渲染与缓存优化[J]. 计算机工程, 2024, 50(7): 227-239.
[5]	雷斗威, 何德彪, 罗敏, 彭聪. 基于AVX512的格密码高速并行实现[J]. 计算机工程, 2024, 50(2): 15-24.
[6]	王新迪, 杨夙, 张思源, 罗午阳, 李杰, 刘辉. 基于时空大数据与卫星图像的城市火灾风险预测[J]. 计算机工程, 2023, 49(6): 242-249.
[7]	王其涵, 庞建民, 岳峰, 祝迪, 沈莉, 肖谦. 面向申威架构的KNN并行算法实现与优化[J]. 计算机工程, 2023, 49(5): 286-294.
[8]	夏立斌, 刘晓宇, 姜晓巍, 孙功星. 基于分布式数据集的并行计算框架内存优化方法[J]. 计算机工程, 2023, 49(4): 43-51.
[9]	房俊, 薛晓东, 周云亮. 基于深度生成模型的聚合查询区间估计方法[J]. 计算机工程, 2023, 49(11): 284-292, 301.
[10]	黄瑞, 金光浩, 李磊, 姜文超, 宋庆增. 轻量化神经网络加速器的设计与实现[J]. 计算机工程, 2021, 47(9): 185-190,196.
[11]	易培淮, 李卫东, 林韬, 邹佳恒, 邓子艳, 刘言. GPU在缪子快速模拟中的应用[J]. 计算机工程, 2021, 47(8): 100-108.
[12]	高子轩, 郑烇. NDMANET中基于内容优先级的缓存策略研究[J]. 计算机工程, 2021, 47(3): 190-195.
[13]	佘鑫, 何震瀛. 复杂属性条件下基于Spark的clique社区搜索算法[J]. 计算机工程, 2021, 47(12): 54-61,70.
[14]	郭渝洛, 边浩东, 董润婷, 唐嘉豪, 王晓英, 黄建强. 基于SIMD的并行傅里叶空间图像相似度计算[J]. 计算机工程, 2021, 47(11): 247-253.
[15]	郭威, 谢光伟, 张帆, 李敏. 一种分布式存储系统拟态化架构设计与实现[J]. 计算机工程, 2020, 46(6): 12-19.

选择文件类型/文献管理软件名称

选择包含的内容