Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Efficient KV Cache Sparsification via Ring Buffer-Based Sliding Window and Hierarchical Sparsity Enhancement

  

  • Published:2025-10-17

基于环形滑动窗口与分层稀疏增强的高效KV Cache稀疏化方法

Abstract: In long-context and high-concurrent scenarios, large language models (LLMs) encounter significant challenges during inference due to the quadratic growth of memory footprint caused by key-value (KV) cache in self-attention mechanisms, leading to excessive GPU memory consumption and limited throughput. Although KV cache sparsification have been proposed to address this issue, existing approaches still suffer from deficiencies in memory footprint, complexity of sliding window design, and computation-memory access overhead. This paper proposes DoubleSparse++, a triple-optimization framework that addresses these limitations through three innovative techniques: (1) A ring buffer-based sliding window decouples KV cache size from text length while reducing buffer update complexity from O(L) to O(1); (2) An exponential decay sparse equilibrium strategy dynamically allocates token sparsity according to layer indices, achieving progressive sparsification across layers; (3) Optimize the sparse inference kernel by implementing operator fusion and asynchronous device stream pipelines, achieving overlapped computation and memory access in long-context inference scenario, which significantly enhances computational intensity while reducing memory access frequency. Experimental validations conducted on domestic accelerators and mainstream LLMs (including OPT-6.7B, Vicuna-7B-v1.5, LLaMA-2-7B, LLaMA-3.1-8B, Qwen-2.5-7B) demonstrate that DoubleSparse++ achieves 1.31X inference speedup and 0.72X memory footprint reduction compared to DoubleSparse for 4K token generation tasks. Especially, in 13K token scenarios, the memory footprint further reduces to 0.56X of the baseline. Comprehensive performance analysis confirms that DoubleSparse++ constitutes an efficient KV cache sparse method, demonstrating strong applicability for LLM long-context inference and streaming deployment.

摘要: 在长上下文和高并发场景下,由于自注意力机制中键值缓存(KV Cache)的显存占用呈平方级增长,大语言模型(LLM)推理时面临显存需求过高、吞吐量受限的困境。KV Cache稀疏化方法是用于解决此问题的高效方法,但现有稀疏化方法存在显存占用、滑动窗口设计复杂度及计算访存效率方面的不足。DoubleSparse++方法集成三重优化技术弥补上述不足:一是使用基于环形缓冲区的滑动窗口,将KV Cache大小与文本长度解耦,并将缓冲区更新复杂度由O(L)降为O(1);二是采用指数衰减稀疏均衡策略,根据层索引动态分配令牌稀疏度,实现逐层增强稀疏性;三是进行稀疏化推理内核优化,通过算子融合与设备流异步流水线,实现长上下文推理场景下计算与访存重叠,大幅提升计算强度并降低访存次数。实验在国产加速卡和多种主流LLM(包括OPT-6.7B、Vicuna-7B-v1.5、LLaMA-2-7B、LLaMA-3.1-8B、Qwen-2.5-7B)上进行,结果表明,在4K Tokens生成任务中,DoubleSparse++相比DoubleSparse实现平均1.31X推理加速和0.72X显存占用;在13K Tokens场景下显存占用比进一步降低至0.56X。综合性能验证表明,DoubleSparse++是一种高效的KV Cache稀疏化方法,适用于LLM长上下文推理与流式部署。