A Load-Aware Optimization Method for Mixture-of-Experts Networks on Edge FPGAs

doi:10.19678/j.issn.1000-3428.0260386

Abstract

Abstract: Deploying Mixture-of-Experts (MoE) networks on resource-constrained edge FPGAs faces severe memory wall and load imbalance challenges. Existing dynamic scheduling or batch processing solutions struggle to meet the strict real-time requirements of streaming inference. To address these issues, a load-aware hardware-software co-optimization method is proposed. Leveraging the long-tail distribution characteristics of expert activations, a Probability-Aware Static Locking (PASL) strategy is designed to minimize memory access latency under limited capacity via a hierarchical storage mechanism. Simultaneously, a statistics-driven automated Design Space Exploration (DSE) engine is constructed to achieve the optimal non-uniform allocation of computational resources. Furthermore, to tackle the macro distribution drift challenge prevalent in real-world edge scenarios, a load-evolution-oriented hysteretic hardware-software co-reconfiguration mechanism is proposed, which effectively filters out micro-semantic noise and prevents cache thrashing. Experimental results demonstrate that in single-frame streaming inference scenarios, the proposed method achieves up to a 2.22× throughput improvement over the uniform allocation strategy and up to a 1.52× improvement over the state-of-the-art Edge-MoE solution. In terms of energy efficiency, it surpasses CPU and GPU baselines by up to 2.9× and 3.1×, respectively, while achieving an end-to-end latency as low as 16.33 ms when processing complex Vision Transformers. When confronted with dynamic distribution drift, the proposed mechanism delivers a 17.3% throughput improvement over the static baseline while maintaining zero additional overhead in steady-state random scenarios. Ultimately, this approach effectively resolves the bottlenecks of real-time performance, energy efficiency and dynamic environmental adaptability in edge MoE network deployments.

摘要： 混合专家（MoE）网络在资源受限的边缘FPGA部署时面临严峻的存储墙与负载不均衡挑战。现有动态调度或批处理方案难以满足流式推理的实时性要求。为此，提出一种基于负载感知的软硬件协同优化方法。利用专家激活的长尾分布特征，设计概率感知静态锁定（PASL）策略，通过分层存储机制在有限容量下最小化访存延迟；同时，构建统计驱动的自动化设计空间探索（DSE）引擎，实现计算资源的非均匀最优适配。此外，针对真实边缘场景中普遍存在的宏观分布漂移挑战，提出了一种面向负载演进的迟滞型软硬协同重构机制，有效屏蔽了微观语义噪声并避免了缓存震荡。实验表明，在单帧流式推理场景下，该方法相比均匀分配策略吞吐率最高可提升2.22倍，相比现有先进方案Edge-MoE的策略吞吐率最高可提升1.52倍；在各任务的能效比方面，最高可达CPU和GPU的2.9倍和3.1倍；在处理复杂视觉Transformer时端到端延迟低至16.33ms。在面临动态分布漂移时，相较于静态基线实现了17.3%的吞吐率提升，同时在稳态随机场景保持了零额外开销，有效解决了边缘端MoE网络部署的实时性、能效与动态环境适应性瓶颈。

LI Bo, LIU Shouwen, YUAN Mengting. A Load-Aware Optimization Method for Mixture-of-Experts Networks on Edge FPGAs[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0260386.

李博, 刘首文, 袁梦霆. 边缘FPGA混合专家网络负载感知优化方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0260386.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0260386

References

[1] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]//9th International Conference on Learning Representations. Virtual Event, Austria: OpenReview.net, 2021.
[2] LIU Z, LIN Y, CAO Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]//2021 IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE, 2021: 9992-10002.
[3] HAN K, WANG Y, CHEN H, et al. A survey on vision transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(1): 87-110.
[4] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc., 2017: 6000-6010.
[5] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates, Inc., 2020: 1877-1901.
[6] SHAZEER N, MIRHOSEINI A, MAZIARZ K, et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer[C]//5th International Conference on Learning Representations. Toulon, France: OpenReview.net, 2017.
[7] FEDUS W, ZOPH B, SHAZEER N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity[J]. Journal of Machine Learning Research, 2022, 23(120): 1-39.
[8] LEPIKHIN D, LEE H, XU Y, et al. GShard: Scaling giant models with conditional computation and automatic sharding[C]//9th International Conference on Learning Representations. Virtual Event, Austria: OpenReview.net, 2021.
[9] DAI D, DENG C, ZHAO C, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand: Association for Computational Linguistics, 2024: 1280-1297.
[10] 史宏志, 赵健, 赵雅倩, 等. 大模型时代的混合专家系统优化综述[J]. 计算机研究与发展, 2025, 62(5): 1164-1189. SHI H Z, ZHAO J, ZHAO Y Q, et al. Survey on mixture of experts system optimization in the era of large models[J]. Journal of Computer Research and Development, 2025, 62(5): 1164-1189. (in Chinese)
[11] NECHI A, GROTH L, MULHEM S, et al. FPGA-based deep learning inference accelerators: Where are we standing?[J]. ACM Transactions on Reconfigurable Technology and Systems, 2023, 16(4): 1-32.
[12] VENIERIS S I, BOUGANIS C S. fpgaConvNet: Mapping regular and irregular convolutional neural networks on FPGAs[J]. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(2): 326-342.
[13] 余子健, 马德, 严晓浪, 等. 基于FPGA的卷积神经网络加速器[J]. 计算机工程, 2017, 43(1): 109-114, 119. YU Z J, MA D, YAN X L, et al. FPGA-based accelerator for convolutional neural network[J]. Computer Engineering, 2017, 43(1): 109-114, 119. (in Chinese)
[14] LOU W, QIN Y, WANG Z, et al. Automated FPGA accelerator generation framework for transformers with dataflow optimization[C]//Proceedings of the 54th International Conference on Parallel Processing. New York, USA: ACM, 2025: 406-416.
[15] DONG J, LOU W, WU H, et al. MoE-Sched: Enabling efficient FPGA deployment of mixture-of-experts vision transformers via coordinated scheduling[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2025. DOI: 10.1109/TVLSI.2025.3604705.
[16] DONG J, LOU W, ZHENG Z, et al. UbiMoE: A ubiquitous mixture-of-experts vision transformer accelerator with hybrid computation pattern on FPGA[C]//2025 IEEE International Symposium on Circuits and Systems. Washington D. C., USA: IEEE, 2025: 1-5.
[17] SARKAR R, LIANG H, FAN Z, et al. Edge-MoE: Memory-efficient multi-task vision transformer architecture with task-level sparsity via mixture-of-experts[C]//2023 IEEE/ACM International Conference on Computer Aided Design. San Francisco, CA, USA: IEEE, 2023: 1-9.
[18] HE J, QIU J, ZENG A, et al. FastMoE: A fast mixture-of-expert training system[J/OL]. arXiv preprint, 2021[2026-03-24]. https://arxiv.org/abs/2103.13262.
[19] RAJBHANDARI S, LI C, YAO Z, et al. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale[C]//Proceedings of the 39th International Conference on Machine Learning. Baltimore, USA: PMLR, 2022: 18332-18346.
[20] FRANTAR E, ALISTARH D. QMoE: Sub-1-bit compression of trillion parameter models[C]//Proceedings of the 7th Conference on Machine Learning and Systems. Santa Clara, CA, USA: mlsys.org, 2024.
[21] KIM S, GHOLAMI A, YAO Z, et al. I-BERT: Integer-only BERT quantization[C]//Proceedings of the 38th International Conference on Machine Learning. Baltimore, USA: PMLR, 2021: 5506-5518.
[22] LU X, LIU Q, XU Y, et al. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok, Thailand: Association for Computational Linguistics, 2024: 6159-6172.
[23] XILINX. Vitis AI user guide[EB/OL]. (2023)[2024-03-01]. https://www.xilinx.com/.
[24] ZHANG X, YE H, WANG J, et al. DNNExplorer: A framework for modeling and exploring a novel paradigm of FPGA-based DNN accelerator[C]//Proceedings of the 39th IEEE/ACM International Conference on Computer-Aided Design. Washington D. C., USA: IEEE, 2020: 1-9.
[25] WANG T, GONG L, WANG C, et al. ViA: A novel vision-transformer accelerator based on FPGA[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2022, 41(11): 4088-4099.
[26] WANG H, ZHANG Z, HAN S. SpAtten: Efficient sparse attention architecture with cascade token and head pruning[C]//2021 IEEE International Symposium on High-Performance Computer Architecture. Seoul, South Korea: IEEE, 2021: 97-110.
[27] LU L, JIN Y, BI H, et al. Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture[C]//Proceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture. New York, USA: ACM, 2021: 977-991.
[28] BIGGS B, BOUGANIS C S, CONSTANTINIDES G. ATHEENA: A toolflow for hardware early-exit network automation[C]//2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines. Marina Del Rey, CA, USA: IEEE, 2023: 121-132.
[29] LIN X, TIAN H, XUE W, et al. FLAME: Fully leveraging MoE sparsity for transformer on FPGA[C]//Proceedings of the 61st ACM/IEEE Design Automation Conference. Washington D. C., USA: IEEE, 2024: 1-6.

Please choose a citation manager

Content to export