[1]葛旭冉,欧洋,王博,等. 大语言模型推理中的存储优化技术综述 [J]. 计算机研究与发展, 2025, 62 (03): 545-562.
GE Xuran, OU Yang, WANG Bo, et al. A survey of memory optimization techniques for large language model inference [J]. Journal of Computer Research and Development, 2025, 62(3): 545-562.
[2]Li H, Li Y, Tian A, et al. A Survey on Large Language Model Acceleration based on KV Cache Management[J]. ArXiv, 2024, abs/2412.19442. DOI:10.48550/arXiv.2412.19442.
[3]Zhu X, Li J, Liu Y, et al. A Survey on Model Compression for Large Language Models[J]. Transactions of the Association for Computational Linguistics, 2024, 12: 1556-1577.
[4]Naveed H, Khan A U, Qiu S, et al. A Comprehensive Overview of Large Language Models[J]. ArXiv, 2023, abs/2307.06435. DOI:10.48550/arXiv.2307.06435.
[5]Wan Z, Shen H, Wang X, et al. Meda: Dynamic kv cache allocation for efficient multimodal long-context inference[J]. ArXiv, 2025, abs/2502.17599. DOI:10.48550/arXiv.2502.17599.
[6]Dong H, Yang X, Zhang Z, et al. Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference[C]//Forty-first International Conference on Machine Learning, ICML 2024. Vienna, Austria: OpenReview.net, 2024. DOI:10.48550/arXiv.2402.09398.
[7]Han C, Wang Q, Peng H, et al. LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models[C]//North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Mexico City, Mexico: Association for Computational Linguistics, 2024: 3991–4008.
[8]Xiao G, Tian Y, Chen B, et al. Efficient Streaming Language Models with Attention Sinks[C]//The Twelfth International Conference on Learning Representations, ICLR 2024. Vienna, Austria: OpenReview.net, 2024. DOI:10.48550/arXiv.2309.17453.
[9]Cai Z, Zhang Y, Gao B, et al. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling[J]. ArXiv, 2024, abs/2406.02069. DOI:10.48550/arXiv.2406.02069.
[10]Zhang Z, Sheng Y, Zhou T, et al. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models[C]//NeurIPS 2023, volume 36. 10010 NORTH TORREY PINES RD, LA JOLLA, CALIFORNIA 92037 USA: NEURAL INFORMATION PROCESSING SYSTEMS (NIPS), 2023: 34661-34710.DOI:10.48550/arXiv.2306.14048.
[11]Yang S, Sheng Y, Gonzalez J E, et al. Post-Training Sparse Attention with Double Sparsity[J]. ArXiv, 2024, abs/2408.07092. DOI:10.48550/arXiv.2408.07092.
[12]Aminabadi R Y, Rajbhandari S, Awan A A, et al. DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale[C]//SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. Dallas, TX, USA: IEEE, 2022: 1-15.
[13]Ribar L, Chelombiev I, Hudlass-Galley L, et al. SparQ Attention: Bandwidth-Efficient LLM Inference[C]//Forty-first International Conference on Machine Learning, ICML 2024. Vienna, Austria: OpenReview.net, 2024.
[14]曹义魁,陆忠华,张鉴,等.面向国产加速器的CFD核心算法并行优化[J].数据与计算发展前沿, 2021, 3(4): 93-103. DOI:10.11871/jfdc.issn.2096-742X.2021.04.008.
CAO Y K, LU Z H, ZHANG J, et al. Parallel optimization of CFD core algorithms based on domestic processor[J]. Frontiers of Data and Computing, 2021, 3(4): 93-103.
[15]赵文龙, 王武. Gadget-2在一个加速卡异构平台上的移植与优化[J]. 数据与计算发展前沿, 2022, 4(5): 108-119.
ZHAO Wenlong, WANG wu. Porting and Optimizing Gadget-2 on a Heterogeneous Accelerator Platform[J]. Frontiers of Data and Computing, 2022, 4(5): 108-119.
[16]Zheng L, Yin L, Xie Z, et al. SGLang: Efficient Execution of Structured Language Model Programs[C]//NeurIPS 2023, volume 37. 10010 NORTH TORREY PINES RD, LA JOLLA, CALIFORNIA 92037 USA: NEURAL INFORMATION PROCESSING SYSTEMS (NIPS), 2023: 62557-62583.
[17]Chen L, Zhao H, Liu T, et al. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 19-35.
[18]Qin Z, Cao Y, Lin M, et al. CAKE: Cascading and adaptive KV cache eviction with layer preferences[J]. ArXiv, 2025, abs/2503.12491. DOI:10.48550/arXiv.2503.12491.
[19]Williams S, Waterman A, Patterson D. Roofline: an insightful visual performance model for multicore architectures[J]. Communications of the ACM, 2009, 52(4): 65-76.
[20]Agrawal A, Panwar A, Mohan J, et al. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills[J]. ArXiv, 2023, abs/2308.16369. DOI:10.48550/arXiv.2308.16369.
[21]Zhong Y, Liu S, Chen J, et al. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving[C]//18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). Santa Clara, CA: USENIX Association. 2024: 193-210.
[22]Touvron H, Martin L, Stone K, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models[J]. ArXiv, 2023, abs/2307.09288. DOI:10.48550/arXiv.2307.09288.
[23]Grattafiori A, Dubey A, Jauhri A, et al. The Llama 3 Herd of Models[J]. ArXiv, 2024, abs/2407.21783. DOI:10.48550/arXiv.2407.21783.
[24]Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning[C]//The Twelfth International Conference on Learning Representations, ICLR 2024. Vienna, Austria: OpenReview.net, 2024.
[25]Tri Dao, Daniel Haziza, Francisco Massa, et al. Flash-decoding for Long-context Inference[EB/OL]. (2023.10.12)[2025.5.10]. https://crfm.stanford.edu/2023/10/12/flashdecoding.html.
[26]Hong K, Dai G, Xu J, et al. FlashDecoding++: Faster Large Language Model Inference on GPUs[J]. ArXiv, 2023, abs/2311.01282. DOI:10.48550/arXiv.2311.01282.
[27]Kwon W, Li Z, Zhuang S, et al. Efficient Memory Management for Large Language Model Serving with PagedAttention[C]//Proceedings of the 29th Symposium on Operating Systems Principles. New York, NY, USA: Association for Computing Machinery. 2023: 611-626. DOI:10.48550/arXiv.2309.06180.
[28]Lin Y, Tang H, Yang S, et al. QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving[J]. ArXiv, 2024, abs/2405.04532. DOI:10.48550/arXiv.2405.04532.
|