[1] Wang P, Bai S, Tan S, et al. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution[J]. arXiv preprint arXiv:2409.12191, 2024.
[2] Liu H, Li C, Wu Q, et al. Visual instruction tuning[J]. Advances in neural information processing systems, 2023, 36: 34892-34916.
[3] Team G, Anil R, Borgeaud S, et al. Gemini: a family of highly capable multimodal models[J]. arXiv preprint arXiv:2312.11805, 2023.
[4] Gong T, Lyu C, Zhang S, et al. Multimodal-gp t: A vision and language model for dialogue with humans[J]. arXiv preprint arXiv:2305.04790, 2023.
[5] Lee M Y. Building multimodal ai chatbots[J]. arXiv preprint arXiv:2305.03512, 20
[6] Chi X, Zhang R, Jiang Z, et al. M $^{2} $ Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation[J]. arXiv preprint arXiv:2311.17963, 2023.
[7] Wu S, Fei H, Qu L, et al. Next-gpt: Any-to-any multimodal llm[C]//Forty-first International Conference on Machine Learning. 2024.
[8] 林荣鑫,李硕豪,董力铭,等.基于视觉语言大模型的多模态虚假新闻检测[J/OL].计算机工程,1-13[2026-01-02].https://doi.org/10.19678/j.issn.1000-3428.0252354.
Lin Rongxin, Li Shuohao, Dong Liming, et al.Multimodal Fake News Detection Based on Vision–Language Large Models [J/OL]. Computer Engineering, pp. 1–13 [2026-01-02].
[9] Zheng L, Yin L, Xie Z, et al. Sglang: Efficient execution of structured language model programs[J]. Advances in neural information processing systems, 2024, 37: 62557-62583.
[10] Kwon W, Li Z, Zhuang S, et al. Efficient memory management for large language model serving with pagedattention[C]//Proceedings of the 29th symposium on operating systems principles. 2023: 611-626.
[11] Hugging Face. Text Generation Inference[EB/OL]. [2025-05-04].https://github.com/huggingface/text-generation-inference.
[12] Agrawal A, Kedia N, Panwar A, et al. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}[C]//18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 2024: 117-134.
[13] Agrawal A, Panwar A, Mohan J, et al. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills[J]. arXiv preprint arXiv:2308.16369, 2023.
[14] Zhong Y, Liu S, Chen J, et al. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving[C]//18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 2024: 193-210.
[15] Singh G, Wang X, Hu Y, et al. Efficiently Serving Large Multimodal Models Using EPD Disaggregation[J]. arXiv preprint arXiv:2501.05460, 2024.
[16] Qiu H, Biswas A, Zhao Z, et al. Modserve: Scalable and resource-efficient large multimodal model serving[J]. arXiv preprint arXiv:2502.00937, 2025.
[17] Ning Z, Zhao J, Jin Q, et al. Inf-MLLM: Efficient streaming inference of multimodal large language models on a single GPU[J]. arXiv preprint arXiv:2409.09086, 2024.
[18] Liu Z, Liu B, Wang J, et al. Efficient inference of vision instruction-following models with elastic cache[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 54-69.
[19] Yu G I, Jeong J S, Kim G W, et al. Orca: A distributed serving system for {Transformer-Based} generative models[C]//16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 2022: 521-538.
[20] Lyu H, Liu B, Wu M, et al. FairBatching: Fairness-Aware Batch Formation for LLM Inference[J]. arXiv preprint arXiv:2510.14392, 2025.
[21] Hu C, Huang H, Hu J, et al. Memserve: Context caching for disaggregated llm serving with elastic memory pool[J]. arXiv preprint arXiv:2406.17565, 2024.
[22] Qin R, Li Z, He W, et al. Mooncake: A kvcache-centric disaggregated architecture for llm serving[J]. arXiv preprint arXiv:2407.00079, 2024.
[23] 梁绪宁,王思琪,杨海龙,等.基于自适应张量交换和重算的大模型推理优化[J].计算机工程,2025,51(10):27-36.
Liang Xuning, Wang Siqi, Yang Hailong, et al. Large Model Inference Optimization Based on Adaptive Tensor Swapping and Recomputation [J]. Computer Engineering, 2025, 51(10): 27–36.
[24] Jin Y, Wang T, Lin H, et al. P/d-serve: Serving disaggregated large language model at scale[J]. arXiv preprint arXiv:2408.08147, 2024.
[25] Patel P, Choukse E, Zhang C, et al. Splitwise: Efficient generative llm inference using phase splitting[C]//2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024: 118-132.
[26] Hu C, Huang H, Xu L, et al. Inference without interference: Disaggregate llm inference for mixed downstream workloads[J]. arXiv preprint arXiv:2401.11181, 2024.
[27] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
[28] 蔡睿,葛军,孙哲,等.AI预训练大模型发展综述[J].小型微型计算机系统,2024,45(10):2327-2337.Cai Rui, Ge Jun, Sun Zhe, et al. A Survey on the Development of AI Pre-trained Large Models[J]. Journal of Chinese Computer Systems, 2024, 45(10): 2327-2337.
[29] FasterTransformer.https://github.com/NVIDIA/FasterTransformer.
[30] Sidorov O, Hu R, Rohrbach M, et al. Textcaps: a dataset for image captioning with reading comprehension[C]//European conference on computer vision. Cham: Springer International Publishing, 2020:
742-758.
[31] Li Y, Du Y, Zhou K, et al. Evaluating object hallucination in large vision-language models[J]. arXiv preprint arXiv:2305.10355, 2023.
[32] Fu C, Chen P, Shen Y, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024[J]. URL https://arxiv. org/abs/2306.13394, 2024, 2(8).
[33] Gurari D, Li Q, Stangl A J, et al. Vizwiz grand challenge: Answering visual questions from blind people[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 3608-3617.
[34] Singh A, Natarajan V, Shah M, et al. Towards vqa models that can read[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 8317-8326.
[35] Li F, Zhang R, Zhang H, et al. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models[J]. arXiv preprint arXiv:2407.07895, 2024.
|