Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2025, Vol. 51 ›› Issue (10): 27-36. doi: 10.19678/j.issn.1000-3428.0070644

• Research Hotspots and Reviews • Previous Articles     Next Articles

Inference Optimization for Large Models Based on Adaptive Tensor Swapping and Recomputation

LIANG Xuning, WANG Siqi, YANG Hailong*(), LUAN Zhongzhi, LIU Yi, QIAN Depei   

  1. School of Computer Science and Engineering, Beihang University, Beijing 100191, China
  • Received:2024-11-22 Revised:2025-02-22 Online:2025-10-15 Published:2025-04-11
  • Contact: YANG Hailong

基于自适应张量交换和重算的大模型推理优化

梁绪宁, 王思琪, 杨海龙*(), 栾钟治, 刘轶, 钱德沛   

  1. 北京航空航天大学计算机学院, 北京 100191
  • 通讯作者: 杨海龙
  • 基金资助:
    国家重点研发计划(2023YFB3001801); 国家自然科学基金(62322201); 国家自然科学基金(62072018); 国家自然科学基金(U23B2020); 中央高校基本科研业务费专项资金(YWF-23-L-1121); 中央高校基本科研业务费专项资金(JKF-20240198); 复杂软件全国重点实验室项目(SKLSDE-2023ZX-05)

Abstract:

Large Language Models (LLM) have demonstrated outstanding performance in natural language processing tasks. However, their extremely large parameter scales pose a significant challenge because the limited capacity of GPU memory becomes a performance bottleneck for inference tasks. To address this issue in the context of LLM inference services, this study proposes AdaptiveLLM, which enables the adaptive selection of offloading strategies between tensor swapping and tensor recomputation based on the characteristics of inference task workloads. To evaluate the characteristics of inference task workloads, AdaptiveLLM establishes a black-box Machine Learning (ML) model through an operator-level computational complexity analysis to predict the overhead of tensor recomputation. It also predicts the overhead of tensor swapping by conducting a fine-grained analysis of KV Cache memory usage. For the adaptive selection of offloading strategies, AdaptiveLLM designs a cost-aware memory optimization strategy specifically for the pre-emption scheduling phase. When GPU memory is insufficient, it opts for the offloading approach with a lower overhead. For the initiation scheduling phase, it devises a fairness-based user-request scheduling strategy. When GPU memory is available, it schedules more user requests in accordance with the principle of fairness. Experimental results indicate that, compared with currently widely used LLM inference benchmark frameworks, AdaptiveLLM achieves an overall increase in throughput while reducing the average weighted turnaround time, thereby realizing fair scheduling.

Key words: Large Language Models (LLM), inference, tensor swapping, tensor recomputation, throughput, fairness

摘要:

大语言模型(LLM)在多种自然语言处理任务中展现出卓越性能。然而, LLM拥有极高的参数规模, 使得有限的GPU内存容量成为推理任务的性能瓶颈。为此, 面向LLM推理服务场景提出AdaptiveLLM, 根据推理任务负载特征, 在张量交换和张量重算中实现卸载策略的自适应选择。为了评估推理任务负载特征, AdaptiveLLM通过算子粒度计算复杂度分析建立黑盒机器学习(ML)模型, 实现张量重算开销预测, 通过细粒度KV Cache内存占用分析实现张量交换开销预测。为了进行卸载策略的自适应选择, AdaptiveLLM针对抢占调度阶段设计一种基于开销感知的内存优化策略, 在GPU内存不足时选择开销较小的卸载方式。同时, 针对启动调度阶段设计一种基于公平性的用户请求调度策略, 在GPU内存空余时基于公平性原则调度更多的用户请求。实验结果表明, 相比于当前广泛使用的LLM推理基准框架, AdaptiveLLM实现了整体吞吐率的提升, 同时降低了平均带权周转时间, 实现了公平调度。

关键词: 大语言模型, 推理, 张量交换, 张量重算, 吞吐率, 公平性