作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (7): 190-198. doi: 10.19678/j.issn.1000-3428.0068760

• 先进计算与数据处理 • 上一篇    下一篇

基于RDD重用度的Spark自适应缓存优化策略

潘顺杰, 于俊洋, 王龙葛, 李涵*(), 翟锐   

  1. 河南大学软件学院, 河南 开封 475004
  • 收稿日期:2023-11-03 出版日期:2025-07-15 发布日期:2024-06-11
  • 通讯作者: 李涵
  • 基金资助:
    河南省科技攻关项目(232102210029); 河南省科技攻关项目(232102210031)

Spark Adaptive Cache Optimization Strategy Based on the Reuse Degree of RDD

PAN Shunjie, YU Junyang, WANG Longge, LI Han*(), ZHAI Rui   

  1. School of Software, Henan University, Kaifeng 475004, Henan, China
  • Received:2023-11-03 Online:2025-07-15 Published:2024-06-11
  • Contact: LI Han

摘要:

基于内存进行作业计算的Spark分布式计算框架并不考虑作业的中间计算结果, 容易造成高频访问的数据块丢失, 在迭代作业类型中表现更为明显。Spark通过LinkedHashMap提供的哈希表实现最近最少使用(LRU)算法的缓存功能, 最久未被使用的元素被移动到顶部并优先被删除, 且造成数据重算。针对Spark使用的LRU缓存替换算法造成的高频访问但当前未被使用的热点数据被替换出缓存的问题, 提出一种基于弹性分布式数据集(RDD)重用度的Spark自适应缓存优化策略(LCRD), 该策略包括自动缓存算法和缓存自动清理算法。首先, 自动缓存算法在作业执行前对Spark的有向无环图(DAG)进行分析, 计算RDD的重用频率、RDD的算子复杂度等数据, 并对影响执行效率的相关因素进行量化, 根据重用度模型进行计算, 在作业执行中, 应用程序将重用度较高的数据块进行缓存; 其次, 在发生内存瓶颈或RDD缓存无效时, 缓存自动清理算法遍历缓存队列, 并对低频访问的数据块进行清理。实验结果表明, 在选取amazon0302、email-EuAll、web-Google、wiki-Talk等4种公开数据集执行PageRank迭代作业时, 与LRU相比, LCRD的执行效率平均分别提升10.7%、8.6%、17.9%和10.6%, 内存利用率平均分别提升3%、4%、3%和5%。所提策略能够有效提高Spark的执行效率, 同时提升内存利用率。

关键词: 并行计算, Spark框架, 缓存替换, 最近最少使用算法, 大数据

Abstract:

The Spark distributed computing framework for job computation based on memory does not consider the intermediate computation results of jobs, leading to the loss of data blocks with high-frequency access, which is especially evident in iterative job types. Spark realizes the caching function of the Least Recently Used (LRU) algorithm through a hash table provided by LinkedHashMap; the elements that have not been used for the longest time are moved to the top and deleted first, causing data recalculation. To address the issue of high-frequency access and unused hot data being replaced from the cache by the LRU cache replacement algorithm used in Spark, this paper proposes a Spark adaptive caching optimization strategy based on Resilient Distributed Dataset (RDD) reuse degree, named LCRD. It includes automatic caching and cache automatic cleaning algorithms. First, the automatic caching algorithm analyzes the Directed Acyclic Graph (DAG) of Spark before job execution, calculates the reuse frequency and operator complexity of RDD, and quantifies the factors affecting execution efficiency based on the reuse degree model. During job execution, the application caches data blocks with a higher reuse degree. Second, in the case of memory bottlenecks or invalid RDD caching, the automatic cache cleaning algorithm traverses the cache queue and cleans the data blocks with low-frequency access. Experimental results indicate that, compared to LRU, when executing PageRank iterations on four public datasets (amazon0302, email-EuAll, web-Google, and wiki-Talk), the efficiency of LCRD improves by 10.7%, 8.6%, 17.9%, and 10.6%, respectively. The average increases in memory utilization are 3%, 4%, 3%, and 5%, respectively. This proposed strategy effectively enhances Spark execution efficiency and improves memory utilization.

Key words: parallel computing, Spark framework, cache replacement, Least Recently Used (LRU) algorithm, big data