Adaptive Tensor Swapping and Re-computation for Efficient Large Language Model Inference

doi:10.19678/j.issn.1000-3428.0070644

Abstract

Abstract: Large Language Models(LLMs) have demonstrated remarkable capabilities across a wide range of natrual language processing tasks. However, LLMs come with an extremely high amount of parameters, posing significant challenges on inference tasks with GPU memory bottleneck. To address the above issues, AdaptiveLLM is proposed to select the optimal offloading strategy between tensor swapping and tensor re-computation, with awareness of the real time workload. To extract the workload characteristic during inference tasks, AdaptiveLLM employs a black-box ML model for tensor swapping overhead prediction by operator-level complexity analysis, and conducts fine-grained KV Cache memory usage modeling for tensor re-computation overhead estimation. To select the offloading strategy adaptively, AdaptiveLLM adopts a cost-aware memory optimization algorithm during preemptive scheduling phase, selecting the method with lower overhead with limited GPU memory. AdaptiveLLM also introduces a fairness-based request scheduling strategy during startup scheduling phase, handling a larger batch of user requests following fairness-oriented principles when GPU memory is available. The experiment shows that, compared with the mainstream LLM inference baselines, AdaptiveLLM improves the overall throughput, while achieves fairness-oriented scheduling by reducing the average weighted around time.

摘要： 大语言模型（LLMs）在多种自然语言处理任务中展现出卓越性能。然而，LLM拥有极高的参数规模，使得有限的GPU内存容量成为推理任务的性能瓶颈。面向LLM推理服务场景提出AdaptiveLLM，根据推理任务负载特征在张量交换和张量重算中实现卸载策略自适应选择。为了评估推理任务负载特征，AdaptiveLLM通过算子粒度计算复杂度分析建立黑盒ML模型实现张量重算开销预测，通过细粒度KV Cache内存占用分析实现张量交换开销预测。为了进行卸载策略自适应选择，AdaptiveLLM针对抢占调度阶段设计了基于开销感知的内存优化策略，在GPU内存不足时选择开销较小的卸载方式。同时，针对启动调度阶段设计了基于公平性的用户请求调度策略，在GPU内存空余时基于公平性原则调度更多用户请求。实验结果表明，相比于当前广泛使用的LLM推理基准框架，AdaptiveLLM实现了整体吞吐率提升，同时降低平均带权周转时间，实现了公平调度。

LIANG Xuning, WANG Siqi, YANG Hailong, LUAN Zhongzhi, LIU Yi, QIAN Depei. Adaptive Tensor Swapping and Re-computation for Efficient Large Language Model Inference[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0070644.

梁绪宁, 王思琪, 杨海龙, 栾钟治, 刘轶, 钱德沛. 基于自适应张量交换和重算的大模型推理优化[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0070644.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0070644

Please choose a citation manager

Content to export