作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

面向多模态大模型在线推理的优化调度方法

  • 发布日期:2026-04-15

Optimization Scheduling Method for Online Inference of Multimodal Large Models

  • Published:2026-04-15

摘要: 近年来,多模态大模型(Multimodal Large Language Models, MLLMs)发展迅速,其高效推理服务的部署面临严峻挑战。现有在线推理调度策略(如连续批处理、无停顿调度等)主要针对纯文本大语言模型设计,通常将请求的编码与预填充阶段合并处理。然而,多模态输入在编码阶段具有更长且差异更大的处理耗时,若沿用现有粗粒度调度方式,易导致计算资源闲置、推理延迟增加,进而严重制约系统整体的有效吞吐率。为此,研究提出一种面向多模态大模型在线系统的提高有效吞吐率在线推理调度策略——STEP(Stage-based Time Estimation Priority Scheduling)。该策略的核心创新在于对推理过程进行更细粒度的阶段解耦与调度:首先,将多模态请求的推理过程划分为编码、预填充和解码三个可独立调度的阶段;其次,基于历史执行数据构建轻量级时间预测模型,准确估计批次执行时间以满足令牌间延迟要求;最后,引入一种时延感知的优先级调度机制,以适配不同请求的首令牌延迟要求。实验在图文问答、图像理解等任务的五个开源多模态数据集上与多个基准方法进行比较。结论表明,通过细粒度的调度与执行时间预测,STEP策略能有效适配多模态大模型的推理特性,显著提升在线推理系统的有效吞吐率。

Abstract: In recent years, Multimodal Large Language Models (MLLMs) have advanced rapidly, making the deployment of efficient inference services increasingly challenging. Existing online inference scheduling strategies, such as continuous batching and stall-free scheduling, are primarily designed for text-only large language models. They typically merge the encoding and prefill stages of requests into a single scheduling unit. However, multimodal inputs require significantly longer and more variable processing times during the encoding stage. Employing these coarse-grained scheduling approaches can easily lead to computational resource idling, increased inference latency, and ultimately constrain the overall effective throughput of the system. To address this issue, this study proposes an online inference scheduling strategy, named STEP (Stage-based Time Estimation Priority Scheduling), aimed at enhancing the effective throughput for MLLMs. The key innovation of STEP lies in fine-grained stage decoupling and scheduling of the inference process. Specifically, the multimodal inference pipeline is decomposed into three independently schedulable stages: encoding, prefill, and decoding. Furthermore, STEP employs a lightweight execution-time prediction model trained on historical profiling data to accurately estimate batch execution time under TPOT(Time per Output Tokens) requirements. Finally, a priority-based scheduling mechanism is introduced to accommodate diverse TTFT(Time to First Token) requirements across requests. Experiments were conducted on five open-source multimodal datasets covering tasks such as visual question answering and image understanding and were compared against several baseline methods. The results demonstrate that through stage-aware fine-grained scheduling and execution time prediction, the STEP strategy effectively adapts to the inference characteristics of MLLMs and significantly improves the effective throughput efficiency of online inference systems.