Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

A Load-Aware Optimization Method for Mixture-of-Experts Networks on Edge FPGAs

  

  • Published:2026-06-11

边缘FPGA混合专家网络负载感知优化方法

Abstract: Deploying Mixture-of-Experts (MoE) networks on resource-constrained edge FPGAs faces severe memory wall and load imbalance challenges. Existing dynamic scheduling or batch processing solutions struggle to meet the strict real-time requirements of streaming inference. To address these issues, a load-aware hardware-software co-optimization method is proposed. Leveraging the long-tail distribution characteristics of expert activations, a Probability-Aware Static Locking (PASL) strategy is designed to minimize memory access latency under limited capacity via a hierarchical storage mechanism. Simultaneously, a statistics-driven automated Design Space Exploration (DSE) engine is constructed to achieve the optimal non-uniform allocation of computational resources. Furthermore, to tackle the macro distribution drift challenge prevalent in real-world edge scenarios, a load-evolution-oriented hysteretic hardware-software co-reconfiguration mechanism is proposed, which effectively filters out micro-semantic noise and prevents cache thrashing. Experimental results demonstrate that in single-frame streaming inference scenarios, the proposed method achieves up to a 2.22× throughput improvement over the uniform allocation strategy and up to a 1.52× improvement over the state-of-the-art Edge-MoE solution. In terms of energy efficiency, it surpasses CPU and GPU baselines by up to 2.9× and 3.1×, respectively, while achieving an end-to-end latency as low as 16.33 ms when processing complex Vision Transformers. When confronted with dynamic distribution drift, the proposed mechanism delivers a 17.3% throughput improvement over the static baseline while maintaining zero additional overhead in steady-state random scenarios. Ultimately, this approach effectively resolves the bottlenecks of real-time performance, energy efficiency and dynamic environmental adaptability in edge MoE network deployments.

摘要: 混合专家(MoE)网络在资源受限的边缘FPGA部署时面临严峻的存储墙与负载不均衡挑战。现有动态调度或批处理方案难以满足流式推理的实时性要求。为此,提出一种基于负载感知的软硬件协同优化方法。利用专家激活的长尾分布特征,设计概率感知静态锁定(PASL)策略,通过分层存储机制在有限容量下最小化访存延迟;同时,构建统计驱动的自动化设计空间探索(DSE)引擎,实现计算资源的非均匀最优适配。此外,针对真实边缘场景中普遍存在的宏观分布漂移挑战,提出了一种面向负载演进的迟滞型软硬协同重构机制,有效屏蔽了微观语义噪声并避免了缓存震荡。实验表明,在单帧流式推理场景下,该方法相比均匀分配策略吞吐率最高可提升2.22倍,相比现有先进方案Edge-MoE的策略吞吐率最高可提升1.52倍;在各任务的能效比方面,最高可达CPU和GPU的2.9倍和3.1倍;在处理复杂视觉Transformer时端到端延迟低至16.33ms。在面临动态分布漂移时,相较于静态基线实现了17.3%的吞吐率提升,同时在稳态随机场景保持了零额外开销,有效解决了边缘端MoE网络部署的实时性、能效与动态环境适应性瓶颈。