面向多模态大模型在线推理的优化调度方法

doi:10.19678/j.issn.1000-3428.0260012

摘要/Abstract

摘要： 近年来，多模态大模型（Multimodal Large Language Models, MLLMs）发展迅速，其高效推理服务的部署面临严峻挑战。现有在线推理调度策略（如连续批处理、无停顿调度等）主要针对纯文本大语言模型设计，通常将请求的编码与预填充阶段合并处理。然而，多模态输入在编码阶段具有更长且差异更大的处理耗时，若沿用现有粗粒度调度方式，易导致计算资源闲置、推理延迟增加，进而严重制约系统整体的有效吞吐率。为此，研究提出一种面向多模态大模型在线系统的提高有效吞吐率在线推理调度策略——STEP（Stage-based Time Estimation Priority Scheduling）。该策略的核心创新在于对推理过程进行更细粒度的阶段解耦与调度：首先，将多模态请求的推理过程划分为编码、预填充和解码三个可独立调度的阶段；其次，基于历史执行数据构建轻量级时间预测模型，准确估计批次执行时间以满足令牌间延迟要求；最后，引入一种时延感知的优先级调度机制，以适配不同请求的首令牌延迟要求。实验在图文问答、图像理解等任务的五个开源多模态数据集上与多个基准方法进行比较。结论表明，通过细粒度的调度与执行时间预测，STEP策略能有效适配多模态大模型的推理特性，显著提升在线推理系统的有效吞吐率。

Abstract: In recent years, Multimodal Large Language Models (MLLMs) have advanced rapidly, making the deployment of efficient inference services increasingly challenging. Existing online inference scheduling strategies, such as continuous batching and stall-free scheduling, are primarily designed for text-only large language models. They typically merge the encoding and prefill stages of requests into a single scheduling unit. However, multimodal inputs require significantly longer and more variable processing times during the encoding stage. Employing these coarse-grained scheduling approaches can easily lead to computational resource idling, increased inference latency, and ultimately constrain the overall effective throughput of the system. To address this issue, this study proposes an online inference scheduling strategy, named STEP (Stage-based Time Estimation Priority Scheduling), aimed at enhancing the effective throughput for MLLMs. The key innovation of STEP lies in fine-grained stage decoupling and scheduling of the inference process. Specifically, the multimodal inference pipeline is decomposed into three independently schedulable stages: encoding, prefill, and decoding. Furthermore, STEP employs a lightweight execution-time prediction model trained on historical profiling data to accurately estimate batch execution time under TPOT(Time per Output Tokens) requirements. Finally, a priority-based scheduling mechanism is introduced to accommodate diverse TTFT(Time to First Token) requirements across requests. Experiments were conducted on five open-source multimodal datasets covering tasks such as visual question answering and image understanding and were compared against several baseline methods. The results demonstrate that through stage-aware fine-grained scheduling and execution time prediction, the STEP strategy effectively adapts to the inference characteristics of MLLMs and significantly improves the effective throughput efficiency of online inference systems.

董现哲, 王晓衡, 李京. 面向多模态大模型在线推理的优化调度方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0260012.

DONG Xianzhe, WANG Xiaoheng, LI Jing. Optimization Scheduling Method for Online Inference of Multimodal Large Models[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0260012.

参考文献

[1] Wang P, Bai S, Tan S, et al. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution[J]. arXiv preprint arXiv:2409.12191, 2024.
[2] Liu H, Li C, Wu Q, et al. Visual instruction tuning[J]. Advances in neural information processing systems, 2023, 36: 34892-34916.
[3] Team G, Anil R, Borgeaud S, et al. Gemini: a family of highly capable multimodal models[J]. arXiv preprint arXiv:2312.11805, 2023.
[4] Gong T, Lyu C, Zhang S, et al. Multimodal-gp
t: A vision and language model for dialogue with humans[J]. arXiv preprint arXiv:2305.04790, 2023. [5] Lee M Y. Building multimodal ai chatbots[J]. arXiv preprint arXiv:2305.03512, 20
[6] Chi X, Zhang R, Jiang Z, et al. M $^{2} $ Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation[J]. arXiv preprint arXiv:2311.17963, 2023.
[7] Wu S, Fei H, Qu L, et al. Next-gpt: Any-to-any multimodal llm[C]//Forty-first International Conference on Machine Learning. 2024.
[8] 林荣鑫,李硕豪,董力铭,等.基于视觉语言大模型的多模态虚假新闻检测[J/OL].计算机工程,1-13[2026-01-02].https://doi.org/10.19678/j.issn.1000-3428.0252354.
Lin Rongxin, Li Shuohao, Dong Liming, et al.Multimodal Fake News Detection Based on Vision–Language Large Models [J/OL]. Computer Engineering, pp. 1–13 [2026-01-02].
[9] Zheng L, Yin L, Xie Z, et al. Sglang: Efficient execution of structured language model programs[J]. Advances in neural information processing systems, 2024, 37: 62557-62583.
[10] Kwon W, Li Z, Zhuang S, et al. Efficient memory management for large language model serving with pagedattention[C]//Proceedings of the 29th symposium on operating systems principles. 2023: 611-626.
[11] Hugging Face. Text Generation Inference[EB/OL]. [2025-05-04].https://github.com/huggingface/text-generation-inference.
[12] Agrawal A, Kedia N, Panwar A, et al. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}[C]//18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 2024: 117-134.
[13] Agrawal A, Panwar A, Mohan J, et al. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills[J]. arXiv preprint arXiv:2308.16369, 2023.
[14] Zhong Y, Liu S, Chen J, et al. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving[C]//18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 2024: 193-210.
[15] Singh G, Wang X, Hu Y, et al. Efficiently Serving Large Multimodal Models Using EPD Disaggregation[J]. arXiv preprint arXiv:2501.05460, 2024.
[16] Qiu H, Biswas A, Zhao Z, et al. Modserve: Scalable and resource-efficient large multimodal model serving[J]. arXiv preprint arXiv:2502.00937, 2025.
[17] Ning Z, Zhao J, Jin Q, et al. Inf-MLLM: Efficient streaming inference of multimodal large language models on a single GPU[J]. arXiv preprint arXiv:2409.09086, 2024.
[18] Liu Z, Liu B, Wang J, et al. Efficient inference of vision instruction-following models with elastic cache[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 54-69.
[19] Yu G I, Jeong J S, Kim G W, et al. Orca: A distributed serving system for {Transformer-Based} generative models[C]//16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 2022: 521-538.
[20] Lyu H, Liu B, Wu M, et al. FairBatching: Fairness-Aware Batch Formation for LLM Inference[J]. arXiv preprint arXiv:2510.14392, 2025.
[21] Hu C, Huang H, Hu J, et al. Memserve: Context caching for disaggregated llm serving with elastic memory pool[J]. arXiv preprint arXiv:2406.17565, 2024.
[22] Qin R, Li Z, He W, et al. Mooncake: A kvcache-centric disaggregated architecture for llm serving[J]. arXiv preprint arXiv:2407.00079, 2024.
[23] 梁绪宁,王思琪,杨海龙,等.基于自适应张量交换和重算的大模型推理优化[J].计算机工程,2025,51(10):27-36. Liang Xuning, Wang Siqi, Yang Hailong, et al. Large Model Inference Optimization Based on Adaptive Tensor Swapping and Recomputation [J]. Computer Engineering, 2025, 51(10): 27–36.
[24] Jin Y, Wang T, Lin H, et al. P/d-serve: Serving disaggregated large language model at scale[J]. arXiv preprint arXiv:2408.08147, 2024.
[25] Patel P, Choukse E, Zhang C, et al. Splitwise: Efficient generative llm inference using phase splitting[C]//2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024: 118-132.
[26] Hu C, Huang H, Xu L, et al. Inference without interference: Disaggregate llm inference for mixed downstream workloads[J]. arXiv preprint arXiv:2401.11181, 2024.
[27] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
[28] 蔡睿,葛军,孙哲,等.AI预训练大模型发展综述[J].小型微型计算机系统，2024,45(10):2327-2337.Cai Rui, Ge Jun, Sun Zhe, et al. A Survey on the Development of AI Pre-trained Large Models[J]. Journal of Chinese Computer Systems, 2024, 45(10): 2327-2337.
[29] FasterTransformer.https://github.com/NVIDIA/FasterTransformer.
[30] Sidorov O, Hu R, Rohrbach M, et al. Textcaps: a dataset for image captioning with reading comprehension[C]//European conference on computer vision. Cham: Springer International Publishing, 2020: 742-758.
[31] Li Y, Du Y, Zhou K, et al. Evaluating object hallucination in large vision-language models[J]. arXiv preprint arXiv:2305.10355, 2023.
[32] Fu C, Chen P, Shen Y, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024[J]. URL https://arxiv. org/abs/2306.13394, 2024, 2(8).
[33] Gurari D, Li Q, Stangl A J, et al. Vizwiz grand challenge: Answering visual questions from blind people[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 3608-3617.
[34] Singh A, Natarajan V, Shah M, et al. Towards vqa models that can read[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 8317-8326.
[35] Li F, Zhang R, Zhang H, et al. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models[J]. arXiv preprint arXiv:2407.07895, 2024.

选择文件类型/文献管理软件名称

选择包含的内容