Efficient KV Cache Sparsification via Ring Buffer-Based Sliding Window and Hierarchical Sparsity Enhancement

doi:10.19678/j.issn.1000-3428.0252452

Abstract

Abstract: In long-context and high-concurrent scenarios, large language models (LLMs) encounter significant challenges during inference due to the quadratic growth of memory footprint caused by key-value (KV) cache in self-attention mechanisms, leading to excessive GPU memory consumption and limited throughput. Although KV cache sparsification have been proposed to address this issue, existing approaches still suffer from deficiencies in memory footprint, complexity of sliding window design, and computation-memory access overhead. This paper proposes DoubleSparse++, a triple-optimization framework that addresses these limitations through three innovative techniques: (1) A ring buffer-based sliding window decouples KV cache size from text length while reducing buffer update complexity from O(L) to O(1); (2) An exponential decay sparse equilibrium strategy dynamically allocates token sparsity according to layer indices, achieving progressive sparsification across layers; (3) Optimize the sparse inference kernel by implementing operator fusion and asynchronous device stream pipelines, achieving overlapped computation and memory access in long-context inference scenario, which significantly enhances computational intensity while reducing memory access frequency. Experimental validations conducted on domestic accelerators and mainstream LLMs (including OPT-6.7B, Vicuna-7B-v1.5, LLaMA-2-7B, LLaMA-3.1-8B, Qwen-2.5-7B) demonstrate that DoubleSparse++ achieves 1.31X inference speedup and 0.72X memory footprint reduction compared to DoubleSparse for 4K token generation tasks. Especially, in 13K token scenarios, the memory footprint further reduces to 0.56X of the baseline. Comprehensive performance analysis confirms that DoubleSparse++ constitutes an efficient KV cache sparse method, demonstrating strong applicability for LLM long-context inference and streaming deployment.

摘要： 在长上下文和高并发场景下，由于自注意力机制中键值缓存（KV Cache）的显存占用呈平方级增长，大语言模型（LLM）推理时面临显存需求过高、吞吐量受限的困境。KV Cache稀疏化方法是用于解决此问题的高效方法，但现有稀疏化方法存在显存占用、滑动窗口设计复杂度及计算访存效率方面的不足。DoubleSparse++方法集成三重优化技术弥补上述不足：一是使用基于环形缓冲区的滑动窗口，将KV Cache大小与文本长度解耦，并将缓冲区更新复杂度由O(L)降为O(1)；二是采用指数衰减稀疏均衡策略，根据层索引动态分配令牌稀疏度，实现逐层增强稀疏性；三是进行稀疏化推理内核优化，通过算子融合与设备流异步流水线，实现长上下文推理场景下计算与访存重叠，大幅提升计算强度并降低访存次数。实验在国产加速卡和多种主流LLM（包括OPT-6.7B、Vicuna-7B-v1.5、LLaMA-2-7B、LLaMA-3.1-8B、Qwen-2.5-7B）上进行，结果表明，在4K Tokens生成任务中，DoubleSparse++相比DoubleSparse实现平均1.31X推理加速和0.72X显存占用；在13K Tokens场景下显存占用比进一步降低至0.56X。综合性能验证表明，DoubleSparse++是一种高效的KV Cache稀疏化方法，适用于LLM长上下文推理与流式部署。

Lin Hai, Yu Guo, Yin Zeming, Xu Xianchong, Liu Yuhai. Efficient KV Cache Sparsification via Ring Buffer-Based Sliding Window and Hierarchical Sparsity Enhancement[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252452.

林海, 余果, 尹泽明, 徐显冲, 刘玉海. 基于环形滑动窗口与分层稀疏增强的高效KV Cache稀疏化方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252452.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0252452

References

[1]葛旭冉,欧洋,王博,等. 大语言模型推理中的存储优化技术综述 [J]. 计算机研究与发展, 2025, 62 (03): 545-562. GE Xuran, OU Yang, WANG Bo, et al. A survey of memory optimization techniques for large language model inference [J]. Journal of Computer Research and Development, 2025, 62(3): 545-562.
[2]Li H, Li Y, Tian A, et al. A Survey on Large Language Model Acceleration based on KV Cache Management[J]. ArXiv, 2024, abs/2412.19442. DOI:10.48550/arXiv.2412.19442.
[3]Zhu X, Li J, Liu Y, et al. A Survey on Model Compression for Large Language Models[J]. Transactions of the Association for Computational Linguistics, 2024, 12: 1556-1577.
[4]Naveed H, Khan A U, Qiu S, et al. A Comprehensive Overview of Large Language Models[J]. ArXiv, 2023, abs/2307.06435. DOI:10.48550/arXiv.2307.06435.
[5]Wan Z, Shen H, Wang X, et al. Meda: Dynamic kv cache allocation for efficient multimodal long-context inference[J]. ArXiv, 2025, abs/2502.17599. DOI:10.48550/arXiv.2502.17599.
[6]Dong H, Yang X, Zhang Z, et al. Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference[C]//Forty-first International Conference on Machine Learning, ICML 2024. Vienna, Austria: OpenReview.net, 2024. DOI:10.48550/arXiv.2402.09398.
[7]Han C, Wang Q, Peng H, et al. LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models[C]//North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Mexico City, Mexico: Association for Computational Linguistics, 2024: 3991–4008.
[8]Xiao G, Tian Y, Chen B, et al. Efficient Streaming Language Models with Attention Sinks[C]//The Twelfth International Conference on Learning Representations, ICLR 2024. Vienna, Austria: OpenReview.net, 2024. DOI:10.48550/arXiv.2309.17453.
[9]Cai Z, Zhang Y, Gao B, et al. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling[J]. ArXiv, 2024, abs/2406.02069. DOI:10.48550/arXiv.2406.02069.
[10]Zhang Z, Sheng Y, Zhou T, et al. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models[C]//NeurIPS 2023, volume 36. 10010 NORTH TORREY PINES RD, LA JOLLA, CALIFORNIA 92037 USA: NEURAL INFORMATION PROCESSING SYSTEMS (NIPS), 2023: 34661-34710.DOI:10.48550/arXiv.2306.14048.
[11]Yang S, Sheng Y, Gonzalez J E, et al. Post-Training Sparse Attention with Double Sparsity[J]. ArXiv, 2024, abs/2408.07092. DOI:10.48550/arXiv.2408.07092.
[12]Aminabadi R Y, Rajbhandari S, Awan A A, et al. DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale[C]//SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. Dallas, TX, USA: IEEE, 2022: 1-15.
[13]Ribar L, Chelombiev I, Hudlass-Galley L, et al. SparQ Attention: Bandwidth-Efficient LLM Inference[C]//Forty-first International Conference on Machine Learning, ICML 2024. Vienna, Austria: OpenReview.net, 2024.
[14]曹义魁,陆忠华,张鉴,等.面向国产加速器的CFD核心算法并行优化[J].数据与计算发展前沿, 2021, 3(4): 93-103. DOI:10.11871/jfdc.issn.2096-742X.2021.04.008. CAO Y K, LU Z H, ZHANG J, et al. Parallel optimization of CFD core algorithms based on domestic processor[J]. Frontiers of Data and Computing, 2021, 3(4): 93-103.
[15]赵文龙, 王武. Gadget-2在一个加速卡异构平台上的移植与优化[J]. 数据与计算发展前沿, 2022, 4(5): 108-119. ZHAO Wenlong, WANG wu. Porting and Optimizing Gadget-2 on a Heterogeneous Accelerator Platform[J]. Frontiers of Data and Computing, 2022, 4(5): 108-119.
[16]Zheng L, Yin L, Xie Z, et al. SGLang: Efficient Execution of Structured Language Model Programs[C]//NeurIPS 2023, volume 37. 10010 NORTH TORREY PINES RD, LA JOLLA, CALIFORNIA 92037 USA: NEURAL INFORMATION PROCESSING SYSTEMS (NIPS), 2023: 62557-62583.
[17]Chen L, Zhao H, Liu T, et al. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 19-35. [18]Qin Z, Cao Y, Lin M, et al. CAKE: Cascading and adaptive KV cache eviction with layer preferences[J]. ArXiv, 2025, abs/2503.12491. DOI:10.48550/arXiv.2503.12491.
[19]Williams S, Waterman A, Patterson D. Roofline: an insightful visual performance model for multicore architectures[J]. Communications of the ACM, 2009, 52(4): 65-76.
[20]Agrawal A, Panwar A, Mohan J, et al. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills[J]. ArXiv, 2023, abs/2308.16369. DOI:10.48550/arXiv.2308.16369.
[21]Zhong Y, Liu S, Chen J, et al. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving[C]//18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). Santa Clara, CA: USENIX Association. 2024: 193-210.
[22]Touvron H, Martin L, Stone K, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models[J]. ArXiv, 2023, abs/2307.09288. DOI:10.48550/arXiv.2307.09288.
[23]Grattafiori A, Dubey A, Jauhri A, et al. The Llama 3 Herd of Models[J]. ArXiv, 2024, abs/2407.21783. DOI:10.48550/arXiv.2407.21783.
[24]Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning[C]//The Twelfth International Conference on Learning Representations, ICLR 2024. Vienna, Austria: OpenReview.net, 2024.
[25]Tri Dao, Daniel Haziza, Francisco Massa, et al. Flash-decoding for Long-context Inference[EB/OL]. (2023.10.12)[2025.5.10]. https://crfm.stanford.edu/2023/10/12/flashdecoding.html.
[26]Hong K, Dai G, Xu J, et al. FlashDecoding++: Faster Large Language Model Inference on GPUs[J]. ArXiv, 2023, abs/2311.01282. DOI:10.48550/arXiv.2311.01282.
[27]Kwon W, Li Z, Zhuang S, et al. Efficient Memory Management for Large Language Model Serving with PagedAttention[C]//Proceedings of the 29th Symposium on Operating Systems Principles. New York, NY, USA: Association for Computing Machinery. 2023: 611-626. DOI:10.48550/arXiv.2309.06180.
[28]Lin Y, Tang H, Yang S, et al. QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving[J]. ArXiv, 2024, abs/2405.04532. DOI:10.48550/arXiv.2405.04532.

Please choose a citation manager

Content to export