基于智能网卡优化的存算分离式推荐系统计算节点

doi:10.19678/j.issn.1000-3428.0069251

摘要/Abstract

摘要：

当今工业界通常使用基于深度学习的推荐系统为用户进行定制化推荐, 在常见的存算分离式推理架构中推荐系统的推理速度受限于嵌入层查询部分导致的节点间网络传输瓶颈。新兴的智能网卡技术可以在避免对主机中央处理器(CPU)争用的基础上, 实现复杂流量控制, 为存算分离式推荐系统的嵌入层优化提供新的可能。设计并实现一种基于智能网卡优化的存算分离式推荐系统计算节点——SmartWN(SmartNIC-offloaded Worker Node)。SmartWN利用智能网卡的独立计算与通信能力, 在保证计算节点主机资源不受影响的前提下, 实现了嵌入层查询顺序调度与预准备以及基于流量的多表动态缓存管理, 使推荐系统推理时大幅提高了嵌入层查询的通信效率与缓存利用率, 降低了嵌入层查询时延, 提升了存算分离式推荐系统的推理性能。在智能网卡NVIDIA BlueField-2上实现了SmartWN原型并验证了性能提升, 与现有技术相比, 使用SmartWN作为存算分离式推荐系统计算节点最高提升了2.13倍的推理时嵌入层查询吞吐量, 并降低了约50.6%的嵌入层查询尾部时延。

Abstract:

Deep learning-based recommendation systems are commonly used to provide personalized recommendations. In a common storage—compute disaggregated inference architecture, the inference speed of the recommendation system is limited by the internode network transmission bottleneck caused by embedding queries. The emerging SmartNIC technology enables complex traffic control without contending for host Central Processing Unit (CPU) resources, offering new possibilities for optimizing the embedding layer in disaggregated recommendation systems. This study proposes SmartNIC-offloaded Worker Node (SmartWN), a disaggregated recommendation system worker node optimized via SmartNIC. By leveraging the independent computing and communication capabilities of SmartNICs, SmartWN implements embedding query reordering and preparation, along with traffic-aware dynamic cache management for multiple embedding tables without impacting host resources. This significantly improves communication efficiency and cache utilization during recommendation inference, reduces embedding query latency, and enhances overall system performance. This study implements SmartWN on an NVIDIA BlueField-2 SmartNIC and demonstrates its performance improvements. Compared to existing technologies, using SmartWN as a compute node in a disaggregated recommendation system significantly enhances the embedding layer query throughput by 2.13x and reduces query latency by approximately 50.6%.

Key words: SmartNIC, recommendation system, storage—compute disaggregation, cache management, embedding look up, performance optimization

石睿欣, 严明, 吴杰. 基于智能网卡优化的存算分离式推荐系统计算节点[J]. 计算机工程, 2026, 52(3): 264-275.

SHI Ruixin, YAN Ming, WU Jie. SmartNIC-Offloaded Worker Node for Storage━Compute Disaggregated Recommendation System[J]. Computer Engineering, 2026, 52(3): 264-275.

https://www.ecice06.com/CN/Y2026/V52/I3/264

图/表 16

图1 DLRM计算结构

Fig.1 DLRM compute structure

图2 存算分离式推荐系统运算流程

Fig.2 Compute process of storage — compute disaggregated recommendation system

图3 Criteo-Kaggle数据集中4张嵌入表的100万次索引查询分布

Fig.3 One million index queries of four embedding tables in Criteo-Kaggle dataset

图4 智能网卡结构

Fig.4 Structure of SmartNIC

图5 SmartWN结构与数据流

Fig.5 Structure and dataflow of SmartWN

图6 SmartReorder设计

Fig.6 Design of SmartReorder

图7 SmartWN-nic缓存管理模型

Fig.7 SmartWN-nic cache management model

图8 推理吞吐量对比

Fig.8 Comparison of inference throughput

图9 嵌入层尾部时延对比

Fig.9 Comparison of P99 latency of embedding layers

图10 嵌入层查询缓存命中率对比

Fig.10 Comparison of query cache hit rate of embedding layers

图11 SmartWN与CpuWN使用不同MLP推理时延对比

Fig.11 Comparison of latency of different MLP layers between SmartWN and CpuWN

图12 归一化后各系统计算节点主机CPU使用情况

Fig.12 Normalized CPU usage of worker node host for each system

图13 单推理任务延迟最大值

Fig.13 Maximum delay for a single inference task

参考文献 33

1	GUPTA U, WANG X D, NAUMOV M, et al. Deep learning recommendation model for personalization and recommendation systems[EB/OL]. [2023-12-31]. https://doi.org/10.48550/arXiv.1906.00091.
2	KE L, ZHANG X, LEE B, et al. DisaggRec: architecting disaggregated systems for large-scale personalized recommendation[EB/OL]. [2023-12-31]. https://arxiv.org/abs/2212.00939.
3	JIANG W Q, HE Z H, ZHANG S, et al. MicroRec: efficient recommendation inference by hardware and data structure solutions[C]//Proceedings of Conference on Machine Learning and Systems. [S. l. ]: MLSys Committee, 2021: 845-859.
4	JIANG W Q, HE Z H, ZHANG S, et al. FleetRec: large-scale recommendation inference on hybrid GPU-FPGA clusters[C]//Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. New York, USA: ACM Press, 2021: 3097-3105.
5	GUPTA U, WU C J, WANG X D, et al. The architectural implications of Facebook's DNN-based personalized recommendation[C]//Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). Washington D.C., USA: IEEE Press, 2020: 488-501.
6	HAZELWOOD K, BIRD S, BROOKS D, et al. Applied machine learning at Facebook: a datacenter infrastructure perspective[C]//Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). Washington D.C., USA: IEEE Press, 2018: 620-629.
7	LIU Z R , SONG Q Q , LI L , et al. PME: pruning-based multi-size embedding for recommender systems. Frontiers in Big Data, 2023, 6, 1195742. doi: 10.3389/fdata.2023.1195742
8	LAI F, ZHANG W, LIU R, et al. AdaEmbed: adaptive embedding for large-scale recommendation models[C]// Proceedings of the Operating Systems Design and Implementation(OSDI'23). Boston, USA: USENIX Association, 2023: 817-831.
9	SHI H M, MUDIGERE D, NAUMOV M, et al. Compositional embeddings using complementary partitions for memory-efficient recommendation systems[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, USA: ACM Press, 2020: 165-175.
10	KE L, GUPTA U, CHO B Y, et al. RecNMP: accelerating personalized recommendation with near-memory processing[C]// Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). Washington D.C., USA: IEEE Press, 2020: 790-803.
11	KE L , ZHANG X , SO J , et al. Near-memory processing in action: accelerating personalized recommendation with AxDIMM. IEEE Micro, 2022, 42 (1): 116- 127. doi: 10.1109/MM.2021.3097700
12	ASGARI B, HADIDI R, CAO J S, et al. FAFNIR: accelerating sparse gathering by using efficient near-memory intelligent reduction[C]//Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA). Washington D.C., USA: IEEE Press, 2021: 908-920.
13	ADNAN M , MABOUD Y E , MAHAJAN D , et al. Accelerating recommendation system training by leveraging popular choices. Proceedings of the VLDB Endowment, 2021, 15 (1): 127- 140. doi: 10.14778/3485450.3485462
14	XIE M H, LU Y Y, LIN J Z, et al. Fleche: an efficient GPU embedding cache for personalized recommendations[C]//Proceedings of the 17th European Conference on Computer Systems. New York, USA: ACM Press, 2022: 402-416.
15	SETHI G, ACUN B, AGARWAL N, et al. RecShard: statistical feature-based memory optimization for industry-scale neural recommendation[C]//Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2022: 344-358.
16	NVIDIA. NVIDIA BlueField-2 DPU data center infrastructure on a chip[EB/OL]. [2023-12-31]. https://resources.nvidia.com/en-us-accelerated-networking-resource-library/bluefield-2-dpu-datasheet.
17	Criteo AI Lab. Criteo-Kaggle dataset[EB/OL]. [2023-12-31]. https://kaggle.com/competitions/criteo-display-ad-challenge.
18	Criteo AI Lab. Criteo 1 TB Click Logs dataset[EB/OL]. [2023-12-31]. http://labs.criteo.com/downloads/download-terabyte-click-logs
19	Kaggle. Click-through rate prediction[EB/OL]. [2023-12-31]. https://kaggle.com/competitions/avazu-ctr-prediction.
20	WEI Y C, LANGER M, YU F, et al. A GPU-specialized inference parameter server for large-scale deep recommendation models[C]//Proceedings of the 16th ACM Conference on Recommender Systems. New York, USA: ACM Press, 2022: 408-419.
21	WANG Z H, WEI Y C, LEE M, et al. Merlin HugeCTR: GPU-accelerated recommender system training and inference[C]//Proceedings of the 16th ACM Conference on Recommender Systems. New York, USA: ACM Press, 2022: 534-537.
22	SHAN Y Z, HUANG Y T, CHEN Y L, et al. LegoOS: a disseminated, distributed OS for hardware resource disaggregation[C]//Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Washington D.C., USA: USENIX, 2018: 1-10.
23	JEON M, VENKATARAMAN S, PHANISHAYEE A, et al. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads[C]//Proceedings of the USENIX Annual Technical Conference(ATC'19). Washington D.C., USA: USENIX, 2019: 947-960.
24	NVIDIA. Hierarchical parameter server backend[EB/OL]. [2023-12-31]. https://github.com/triton-inference-server/hugectr_backend.
25	GUO A Q, HAO Y C, WU C S, et al. Software-hardware co-design of heterogeneous SmartNIC system for recommendation models inference and training[C]//Proceedings of the 37th International Conference on Supercomputing. New York, USA: ACM Press, 2023: 336-347.
26	HILDEBRAND M. Efficient large scale DLRM implementation on heterogeneous memory systems [D]. Berkeley, USA: University of California, 2023.
27	ARDESTANI E K, KIM C, LEE S J, et al. Supporting massive DLRM inference through software defined memory[C]//Proceedings of the IEEE 42nd International Conference on Distributed Computing Systems (ICDCS). Washington D.C., USA: IEEE Press, 2022: 302-312.
28	WANG S H, MENG Z L, SUN C, et al. SmartChain: enabling high-performance service chain partition between SmartNIC and CPU[C]//Proceedings of the 2020 IEEE International Conference on Communications (ICC). Washington D.C., USA: IEEE Press, 2020: 1-7.
29	NVIDIA. GPUDirect RDMA[EB/OL]. [2023-12-31]. https://docs.nvidia.com/cuda/gpudirect-rdma/index.html.
30	GROVES T, BROCK B, CHEN Y X, et al. Performance trade-offs in GPU communication: a study of host and device-initiated approaches[C]// Proceedings of the IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). Washington D.C., USA: IEEE Press, 2020: 126-137.
31	Facebook. Deep learning recommendation model[EB/OL]. [2023-12-31]. https://github.com/facebookresearch/dlrm.
32	ZHU Y, HE Z H, JIANG W Q, et al. Distributed recommendation inference on FPGA clusters[C]// Proceedings of the 31st International Conference on Field-Programmable Logic and Applications (FPL). Washington D.C., USA: IEEE Press, 2021: 279-285.
33	胡琪, 朱定局, 吴惠粦, 等. 智能推荐系统研究综述. 计算机系统应用, 2022, 31 (4): 47- 58.
	HU Q , ZHU D J , WU H L , et al. Survey on intelligent recommendation system. Computer Systems and Applications, 2022, 31 (4): 47- 58.

[1]	薛阳, 秦瑶, 张舒翔. 基于双重图注意力网络生成子图的图神经协同推荐[J]. 计算机工程, 2026, 52(2): 89-100.
[2]	郭天晟, 谢瑾奎. 自适应调节图增强与表示结构的推荐模型[J]. 计算机工程, 2026, 52(2): 69-78.
[3]	符家成, 田瑾, 张玉金, 方志军. 结合前置三元组集的知识图谱推荐[J]. 计算机工程, 2025, 51(9): 101-109.
[4]	王帅, 史艳翠. 基于个性化数据增强的自监督序列推荐算法[J]. 计算机工程, 2025, 51(8): 190-202.
[5]	姚迅, 王海鹏, 胡新荣, 杨捷. 基于自适应增强的多视图对比推荐算法[J]. 计算机工程, 2025, 51(5): 103-113.
[6]	李猛坤, 袁晨, 王琪, 赵冲, 陈景轩, 刘立峰. 基于改进YOLOv8算法的在线听课行为识别模型研究[J]. 计算机工程, 2025, 51(1): 287-294.
[7]	王华维, 刘若妍, 艾志玮, 曹轶. 基于多绘制管线的大规模并行体绘制性能优化技术[J]. 计算机工程, 2024, 50(8): 207-215.
[8]	张斯力, 李梓健, 蔡瑞初, 郝志峰, 闫玉光. 基于因果机制约束的强化推荐系统[J]. 计算机工程, 2024, 50(5): 279-290.
[9]	杨兴耀, 马帅, 张祖莲, 于炯, 陈嘉颖, 王东晓. 基于偏好感知的去噪图卷积网络社交推荐[J]. 计算机工程, 2024, 50(10): 154-163.
[10]	吴永庆, 王钰涵, 朱月. 基于用户多类型反馈行为序列的点击率预估模型[J]. 计算机工程, 2024, 50(10): 405-417.
[11]	吴志强, 解庆, 李琳, 刘永坚. 基于多模态融合的图神经网络推荐算法[J]. 计算机工程, 2024, 50(1): 91-100.
[12]	唐彦, 卢镘旭. 基于知识图谱与深度涟漪网络的推荐系统[J]. 计算机工程, 2023, 49(5): 63-72,80.
[13]	林琳, 祝爱琦, 赵明璨, 张帅, 叶炎昊, 徐骥, 韩林, 赵荣彩, 侯超峰. 晶硅分子动力学模拟的GPU加速算法优化[J]. 计算机工程, 2023, 49(4): 166-173.
[14]	李盼, 解庆, 李琳, 刘永坚. 知识增强的图神经网络序列推荐模型[J]. 计算机工程, 2023, 49(2): 70-80.
[15]	李婉桦, 孙英娟, 刘艺璇, 刘乾. 基于全局图和多粒度意图单元的会话推荐[J]. 计算机工程, 2023, 49(10): 136-144, 153.

选择文件类型/文献管理软件名称

选择包含的内容