Inference Optimization for Large Models Based on Adaptive Tensor Swapping and Recomputation

doi:10.19678/j.issn.1000-3428.0070644

Abstract

Abstract:

Large Language Models (LLM) have demonstrated outstanding performance in natural language processing tasks. However, their extremely large parameter scales pose a significant challenge because the limited capacity of GPU memory becomes a performance bottleneck for inference tasks. To address this issue in the context of LLM inference services, this study proposes AdaptiveLLM, which enables the adaptive selection of offloading strategies between tensor swapping and tensor recomputation based on the characteristics of inference task workloads. To evaluate the characteristics of inference task workloads, AdaptiveLLM establishes a black-box Machine Learning (ML) model through an operator-level computational complexity analysis to predict the overhead of tensor recomputation. It also predicts the overhead of tensor swapping by conducting a fine-grained analysis of KV Cache memory usage. For the adaptive selection of offloading strategies, AdaptiveLLM designs a cost-aware memory optimization strategy specifically for the pre-emption scheduling phase. When GPU memory is insufficient, it opts for the offloading approach with a lower overhead. For the initiation scheduling phase, it devises a fairness-based user-request scheduling strategy. When GPU memory is available, it schedules more user requests in accordance with the principle of fairness. Experimental results indicate that, compared with currently widely used LLM inference benchmark frameworks, AdaptiveLLM achieves an overall increase in throughput while reducing the average weighted turnaround time, thereby realizing fair scheduling.

Key words: Large Language Models (LLM), inference, tensor swapping, tensor recomputation, throughput, fairness

摘要：

大语言模型(LLM)在多种自然语言处理任务中展现出卓越性能。然而, LLM拥有极高的参数规模, 使得有限的GPU内存容量成为推理任务的性能瓶颈。为此, 面向LLM推理服务场景提出AdaptiveLLM, 根据推理任务负载特征, 在张量交换和张量重算中实现卸载策略的自适应选择。为了评估推理任务负载特征, AdaptiveLLM通过算子粒度计算复杂度分析建立黑盒机器学习(ML)模型, 实现张量重算开销预测, 通过细粒度KV Cache内存占用分析实现张量交换开销预测。为了进行卸载策略的自适应选择, AdaptiveLLM针对抢占调度阶段设计一种基于开销感知的内存优化策略, 在GPU内存不足时选择开销较小的卸载方式。同时, 针对启动调度阶段设计一种基于公平性的用户请求调度策略, 在GPU内存空余时基于公平性原则调度更多的用户请求。实验结果表明, 相比于当前广泛使用的LLM推理基准框架, AdaptiveLLM实现了整体吞吐率的提升, 同时降低了平均带权周转时间, 实现了公平调度。

关键词: 大语言模型, 推理, 张量交换, 张量重算, 吞吐率, 公平性

LIANG Xuning, WANG Siqi, YANG Hailong, LUAN Zhongzhi, LIU Yi, QIAN Depei. Inference Optimization for Large Models Based on Adaptive Tensor Swapping and Recomputation[J]. Computer Engineering, 2025, 51(10): 27-36.

梁绪宁, 王思琪, 杨海龙, 栾钟治, 刘轶, 钱德沛. 基于自适应张量交换和重算的大模型推理优化[J]. 计算机工程, 2025, 51(10): 27-36.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0070644

https://www.ecice06.com/EN/Y2025/V51/I10/27

Figures/Tables 10

Fig.1 Architecture of AdaptiveLLM

Fig.2 The time flow of tensor swapping

Fig.3 The time flow of tensor recalculation

Fig.4 Calculation process of five operators

Fig.5 Throughput of inference task

Fig.6 The average weighted turnaround time for user requests

Fig.7 Prediction results of swapping overhead

References 27

1	翟洁, 李艳豪, 李彬彬, 等. 基于大语言模型的个性化实验报告评语自动生成与应用. 计算机工程, 2024, 50 (7): 42- 52. doi: 10.19678/j.issn.1000-3428.0069593
	ZHAI J , LI Y H , LI B B , et al. Personalized experiment report comments auto-generation and application based on large language models. Computer Engineering, 2024, 50 (7): 42- 52. doi: 10.19678/j.issn.1000-3428.0069593
2	CHANG Y P , WANG X , WANG J D , et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2024, 15 (3): 1- 45.
3	XIE T , KUANG Y Y , TANG Y , et al. Using LLM-supported lecture summarization system to improve knowledge recall and student satisfaction. Expert Systems with Applications, 2025, 269, 126371. doi: 10.1016/j.eswa.2024.126371
4	DING J H, NGUYEN H, CHEN H H. Evaluation of question-answering based text summarization using LLM invited paper[C]//Proceedings of the IEEE International Conference on Artificial Intelligence Testing (AITest). Washington D.C., USA: IEEE Press, 2024: 142-149.
5	HUANG H, WU S Z, LIANG X N, et al. Towards making the most of LLM for translation quality estimation[EB/OL]. [2024-10-05]. https://link.springer.com/chapter/10.1007/978-3-031-44693-1_30.
6	GAO D H , CHEN K D , CHEN B , et al. LLMs-based machine translation for E-commerce. Expert Systems with Applications, 2024, 258, 125087. doi: 10.1016/j.eswa.2024.125087
7	刘金硕, 文尧. 模板运算代码的自动生成与调优框架. 计算机工程, 2024, 50 (6): 35- 47. doi: 10.19678/j.issn.1000-3428.0068234
	LIU J S , WEN Y . Auto-generation and auto-tuning framework of stencil operation code. Computer Engineering, 2024, 50 (6): 35- 47. doi: 10.19678/j.issn.1000-3428.0068234
8	MU F W , SHI L , WANG S , et al. Clarifygpt: a framework for enhancing LLM-based code generation via requirements clarification. Proceedings of the ACM on Software Engineering, 2024, 1, 2332- 2354. doi: 10.1145/3660810
9	程腾腾, 姚春龙, 于晓强, 等. 基于多头注意力机制融合常识知识的共情对话生成. 计算机工程, 2024, 50 (6): 94- 101. doi: 10.19678/j.issn.1000-3428.0068404
	CHENG T T , YAO C L , YU X Q , et al. Empathetic dialogue generation by incorporating commonsense knowledge based on multi-head attention mechanism. Computer Engineering, 2024, 50 (6): 94- 101. doi: 10.19678/j.issn.1000-3428.0068404
10	ZHUANG Y C , YU Y , WANG K , et al. ToolQA: a dataset for LLM question answering with external tools. Advances in Neural Information Processing Systems, 2023, 36, 50117- 50143.
11	OH H, KIM K, KIM J, et al. ExeGPT: constraint-aware resource scheduling for LLM inference[C]//Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2024: 369-384.
12	GUGGER S, DEBUT L, WOLF T, et al. Accelerate: training and inference at scale made simple, efficient and adaptable[EB/OL]. [2024-10-05]. https://github.com/huggingface/accelerate.
13	BAI S H, WAN Z H, SU F B, et al. LightLLM: a light and fast inference service for LLM[EB/OL]. [2024-10-05]. https://github.com/ModelTC/lightllm.
14	SHENG Y, ZHENG L M, YUAN B H, et al. FlexGen: high-throughput generative inference of large language models with a single GPU[EB/OL]. [2024-10-05]. https://arxiv.org/abs/2303.06865.
15	LIAO J J, LI M Z, YANG H L, et al. Exploiting input tensor dynamics in activation checkpointing for efficient training on GPU[C]//Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS). Washington D.C., USA: IEEE Press, 2023: 156-166.
16	JAIN P, JAIN A, NRUSIMHA A, et al. Checkmate: breaking the memory wall with optimal tensor rematerialization[EB/OL]. [2024-10-05]. https://arxiv.org/abs/1910.02653v3.
17	PENG X, SHI X H, DAI H L, et al. Capuchin: tensor-based GPU memory management for deep learning[C]//Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2020: 891-905.
18	SUN Z B, CAO H Q, WANG Y W, et al. AdaPipe: optimizing pipeline parallelism with adaptive recomputation and partitioning[C]//Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2024: 86-100.
19	ZHANG S S, ROLLER S, GOYAL N, et al. OPT: open pre-trained transformer language models[EB/OL]. [2024-10-05]. https://arxiv.org/abs/2205.01068v4.
20	TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models[EB/OL]. [2024-10-05]. https://arxiv.org/abs/2302.13971v1.
21	DALE R . GPT-3: what's it good for?. Natural Language Engineering, 2021, 27 (1): 113- 118. doi: 10.1017/S1351324920000601
22	DRIESS D, XIA F, SAJJADI M S M, et al. PaLM-E: an embodied multimodal language model[EB/OL]. [2024-10-05]. https://arxiv.org/abs/2303.03378v1.
23	KIM H, YU Y, JIANG L W, et al. ProsocialDialog: a prosocial backbone for conversational agents[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. [S. l.]: ACL, 2022: 4005-4029.
24	WANG Y Z, KORDI Y, MISHRA S, et al. Self-instruct: aligning language models with self-generated instructions[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. [S. l.]: ACL, 2023: 13484-13508.
25	WANG S Q, YANG H L, WANG X Z, et al. Minions: accelerating large language model inference with aggregated speculative execution[EB/OL]. [2024-10-05]. https://arxiv.org/abs/2402.15678v2.
26	KWON W, LI Z H, ZHUANG S Y, et al. Efficient memory management for large language model serving with PagedAttention[C]//Proceedings of the 29th Symposium on Operating Systems Principles. New York, USA: ACM Press, 2023: 611-626.
27	HOLMES C, TANAKA M, WYATT M, et al. DeepSpeed-FastGen: high-throughput text generation for LLMs via MII and DeepSpeed-inference[EB/OL]. [2024-10-05]. https://arxiv.org/abs/2401.08671v1.

[1]	ZHANG Qianhui, YUAN Lingyun, XIE Tianyu, WU Jiaying. Fair Verifiable Secret Sharing Driven by Smart Contracts [J]. Computer Engineering, 2025, 51(9): 177-191.
[2]	WEI Mingkang, LI Jianan, HAN Lin, GAO Wei, ZHAO Rongcai, WANG Hongsheng. Support and Optimization of Multi-Granularity Quantization Framework for Deep Learning Compiler [J]. Computer Engineering, 2025, 51(5): 62-72.
[3]	YIN Zhaoliang, HUANG Yuxin, YU Zhengtao, WANG Guanwen, AI Chuanxian. A Method for Analyzing News Themes Involving Cases with Integrated Crime Classification [J]. Computer Engineering, 2025, 51(4): 208-216.
[4]	WU Ruolan, CHEN Yuling, DOU Hui, ZHANG Yangwen, LONG Zhong. Privacy Preserving Algorithm Using Federated Learning Against Attacks [J]. Computer Engineering, 2025, 51(2): 179-187.
[5]	LIU Zhoufeng, LI Bingrui, YANG Ruimin, LI Chunlei, HE Yuan, DING Shumin. Research on Weakly Supervised Semantic Segmentation Algorithm Based on Modulation-Global Reasoning [J]. Computer Engineering, 2025, 51(2): 344-355.
[6]	Yusong TAN, Tian LI, Yusen ZHANG. Research on Neural Network Model Generation and Deployment for Edge Intelligence [J]. Computer Engineering, 2024, 50(8): 1-12.
[7]	Shulin LIU, Hongjun LI, Yujin GAN, Xiya LUO. Urban Flow Inference Based on Linear Low-Rank Convolution and Road Network [J]. Computer Engineering, 2024, 50(7): 333-341.
[8]	LIU Xinmeng, XIE Jianli, LI Cuiran, WANG Yiming. A3C-Based Interference Cancellation Algorithm for Cognitive Internet of Things Communication [J]. Computer Engineering, 2024, 50(10): 281-290.
[9]	Guohua SUI, Taoran LI, Hao LIU, Lin CHEN, Wei WANG. Research on Domain Knowledge Graph Inference Technology Based on Graph Representation Learning [J]. Computer Engineering, 2023, 49(9): 89-98.
[10]	YANG Liwei, JIA Boyu, WANG Fang, PENG Xiangyuan. Resource Management Algorithm of Visible Light Communication and WiFi Heterogeneous Network [J]. Computer Engineering, 2023, 49(3): 203-210,220.
[11]	CAI Ruichu, WU Yunjin, CHEN Wei, HAO Zhifeng. Collective Causal Relations Discovery Algorithm for Multivariate Time-Series [J]. Computer Engineering, 2023, 49(2): 127-135.
[12]	Hongxiu LIN, Changyou XING, Xi ZHAN. Topology Obfuscation Mechanism Against Multi-mode Network Tomography [J]. Computer Engineering, 2023, 49(12): 282-293, 303.
[13]	Jun LUO, Qingwei GAO, Yi TAN, Dawei ZHAO, Yixiang LU, Dong SUN. Multi-Label Learning Based on Double Laplace Regularization and Causal Inference [J]. Computer Engineering, 2023, 49(11): 49-60.
[14]	WU Maoqiang, HUANG Xumin, KANG Jiawen, YU Rong. Research on Differential Privacy Protection for Collaborative Vehicle-Road Inference [J]. Computer Engineering, 2022, 48(7): 29-35.
[15]	CHEN Naiyue, JIN Yi, LI Yidong, CAI Luxin, WEI Yuanmeng. Federated Learning Model with Fairness Based on Blockchain [J]. Computer Engineering, 2022, 48(6): 33-41.

Please choose a citation manager

Content to export