XiRang：可扩展的高性能地址映射结构设计

doi:10.19678/j.issn.1000-3428.0252422

摘要/Abstract

摘要： 新兴业务在数据中心中引入了大量的大粒度RDMA通信需求．RDMA依赖物理地址通信，大粒度访问时地址翻译所需使用的地址翻译页表项(Page Table Entry, PTE)超过了硬件设备的缓存空间．目前高性能的商用方案使用主机内存储存PTE，但这种结构使大粒度访存只能在读取主机内存的PTE之后才能执行传输，这引入了PCIe穿透和主机内存访问延迟，导致地址翻译效率严重下降和主机CPU开销．为实现高效的大粒度RDMA，本文设计可配置的高性能地址映射结构：息壤(XiRang)．XiRang通过流取机制和层级缓存设计高效扩展访存粒度，通过可配置的地址翻译阵列实现了灵活、高吞吐的地址翻译性能．XiRang原型基于DPU实现，实验表明：1) XiRang有效卸载RDMA数据面地址翻译负荷，与主机CPU解耦；2) XiRang使用的流取扩展机制有效降低存储开销，并发模式下高速缓存消耗仅为10 Byte级别，并发的存储开销可忽略不计；3) XiRang在高并发访存请求数量下,保持接近100%的翻译表项查询命中率，翻译引擎空闲时间相比RNIC架构降低2到3个数量级；4) XiRang的翻译吞吐是RNIC翻译架构的60倍以上，是基础DPU地址映射结构的3.5倍以上；5) 使用性能增强模式，XiRang的地址翻译速度可支持1.4 TB/s的数据传输带宽．

Abstract: Emerging applications in datacenters have introduced a significant amount of large-granularity RDMA communication requirements. RDMA relies on physical addresses, and, when accessing large-granularity data, the Page Table Entries (PTEs) required for address translation exceed the cache capacity of hardware devices. Current high-performance commercial solutions store PTEs in the host memory. However, this architecture requires large-granularity communication to be executed only after fetching the PTEs from the host memory, which introduces PCIe traversal and host memory access latency, severely degrading address translation efficiency and increasing host CPU overhead. To achieve efficient large-granularity RDMA, this paper designs a configurable high-performance address mapping structure: XiRang. XiRang efficiently extends the access granularity through a streaming prefetch mechanism and a hierarchical cache design, and implements flexible and high-throughput address translation performance through a configurable address translation array. The XiRang prototype is implemented based on a DPU. Experiments show that: 1) XiRang effectively offloads the address translation load of the RDMA data plane, decoupling it from the host CPU; 2) The streaming prefetch extension mechanism used by XiRang effectively reduces storage overhead, with cache consumption at only the 10-byte level under concurrent modes, and concurrent storage overhead being negligible; 3) Under a high number of concurrent memory access requests, XiRang maintains a translation table entry query hit rate close to 100%, reducing the idle time of the translation engine by 2 to 3 orders of magnitude compared to the RNIC architecture; 4) The translation throughput of XiRang is more than 60 times that of the RNIC translation architecture and more than 3.5 times that of the basic DPU address mapping structure; 5) In performance enhancement mode, XiRang's address translation speed can support a data transfer bandwidth of 1.4 TB/s.

赵巍岳, 吴婧雅, 卢文岩, 李晓维, 鄢贵海. XiRang：可扩展的高性能地址映射结构设计[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252422.

Zhao Weiyue, Wu Jingya, Lu Wenyan, Li Xiaowei and Yan Guihai. XiRang: Scalable High-Performance Address Mapping Structure[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252422.

参考文献

[1]过敏意.大模型时代网络基础设施的机遇与挑战[J].计算机研究与发展, 2024, 61(11):2663.DOI:10.7544/issn1000-1239.ps20241101. Guo M. Opportunities and Challenges of Network Infrastructure in the Era of Large Language Models [J]. Journal of Computer Research and Development, 2024, 61(11):2663.DOI:10.7544/issn1000-1239.ps20241101.
[2]李纯羽,邓龙,李永坤,等.面向远程内存图数据库的应用感知分离式存储设计[J].计算机科学, 2025(1).DOI:10.11896/jsjkx.231200073. Li Chunyu, Deng L, Li K, et al. Application-aware Disaggregated Storage Design for Remote Memory Graph Database[J]. Computer Science, 2025(1). DOI:10.11896/jsjkx.231200073.
[3]Hoffmann J, Borgeaud S, Mensch A, et al. Training compute-optimal large language models[J]. arxiv preprint arxiv:2203.15556, 2022.
[4]Touvron H, Martin L, Stone K, et al. Llama 2: Open foundation and fine-tuned chat models[J]. arxiv preprint arxiv. 2307.09288, 2023.
[5]Zhao W, Wu J, Lu W, et al. TianMen: a DPU-based storage network offloading structure for disaggregated datacenters[C]//Proceedings of the 2024 ACM Symposium on Cloud Computing. 2024: 689-703.
[6]Liu M, Ene T D, Kirby R, et al. Chipnemo: Domain-adapted llms for chip design[J]. arXiv preprint arXiv:2311.00176, 2023.
[7]Kong X, Chen J, Bai W, et al. Understanding {RDMA} microarchitecture resources for performance isolation[C]//20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 2023: 31-48.
[8]Li A, Song S L, Chen J, et al. Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect[J]. IEEE Transactions on Parallel and Distributed Systems, 2019, 31(1): 94-110.
[9]Weng Q, Xiao W, Yu Y, et al. MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters[C]//19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 2022: 945-960.
[10]NVIDIA技术服务(北京)有限公司．数据处理器：DPU编程入门[M]．北京：机械工业出版社，2023 NVIDIA Technology Service (Beijing) Co., Ltd. Data Processing Unit: An Introduction to DPU Programming [M]. Beijing: China Machine Press, 2023.
[11]中科驭数（北京）科技有限公司.开物数据网络开发平台[EB/OL] [2024-12-31] https://www.yusur.tech/product/DNDP. YUSUR Technology Co., Ltd. KaiWu Data Network Development Platform. [EB/OL] [2024-12-31] https://www.yusur.tech/product/DNDP.
[12]Karamati S, Hughes C, Hemmert K S, et al. “Smarter” NICs for faster molecular dynamics: a case study[C]//2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2022: 583-594.
[13]Barsellotti L, Alhamed F, Olmos J J V, et al. Introducing data processing units (DPU) at the Edge[C]//2022 International Conference on Computer Communications and Networks (ICCCN). IEEE, 2022: 1-6.
[14]Bayatpour M, Sarkauskas N, Subramoni H, et al. Bluesmpi: Efficient mpi non-blocking alltoall offloading designs on modern bluefield smart nics[C]//International Conference on High Performance Computing. Cham: Springer International Publishing, 2021: 18-37.
[15]Sarkauskas N, Bayatpour M, Tran T, et al. Large-message nonblocking mpi_iallgather and mpi ibcast offload via bluefield-2 dpu[C]//2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC). IEEE, 2021: 388-393.
[16]Suresh K K, Michalowicz B, Ramesh B, et al. A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs[C]//2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2023: 123-133.
[17]Liu J, Lin T, Zhang Y, et al. Energy-Constrained Partial Offloading in Data Processing Unit (DPU)-Enabled Mobile Edge Computing[C]//2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta). IEEE, 2022: 664-671.
[18]Njavro A, Tau J, Groves T, et al. A DPU Solution for Container Overlay Networks[J]. arxiv preprint arxiv:2211.10495, 2022.
[19]Gootzen P J, Pfefferle J, Stoica R, et al. DPFS: DPU-Powered File System Virtualization[C]//Proceedings of the 16th ACM International Conference on Systems and Storage. 2023: 1-7.
[20]Michalowicz B, Suresh K K, Subramoni H, et al. Battle of the BlueFields: An In-Depth Comparison of the BlueField-2 and BlueField-3 SmartNICs[C]//2023 IEEE Symposium on High-Performance Interconnects (HOTI). IEEE, 2023: 41-48.
[21]McCalpin J D. Memory bandwidth and machine balance in current high performance computers[J]. IEEE computer society technical committee on computer architecture (TCCA) newsletter, 1995, 2(19-25).
[22]Michalowicz B, Kandadi Suresh K, Subramoni H, et al. DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICs[M]//Practice and Experience in Advanced Research Computing. 2023: 94-101.
[23]Wang Z, Wang C, Wang L. DPUBench: An application-driven scalable benchmark suite for comprehensive DPU evaluation[J]. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 2023, 3(2): 100120.
[24]Miao R, Zhu L, Ma S, et al. From luna to solar: the evolutions of the compute-to-storage networks in Alibaba cloud[C]//Proceedings of the ACM SIGCOMM 2022 Conference (SIGCOMM '22). Association for Computing Machinery, New York, NY, USA, 753–766.
[25]Zhang Y, Li G, Wang J, et al., DoW-KV: A DPU-offloaded and Write-optimized Key-Value Store on Disaggregated Persistent Memory[C]//2023 IEEE International Conference on Cluster Computing (CLUSTER), Santa Fe, NM, USA, 2023, pp. 271-283.
[26]Liao Y, Wu J, Lu W, et al. DPU-Direct: Unleashing Remote Accelerators via Enhanced RDMA for Disaggregated Datacenters[J]. IEEE Transactions on Computers, 2024.
[27]Kang N, Wang Z, Yang F, et al. csRNA: Connection-Scalable RDMA NIC Architecture in Datacenter Environment[C]//2022 IEEE 40th International Conference on Computer Design (ICCD). IEEE, 2022: 398-406.
[28]Ma X, Yang F, Wang Z, et al. A Scalable RDMA Network Interface Card with Efficient Cache Management[C]//2023 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2023: 1-5.
[29]Wang X, Chen G, Yin X, et al. StaR: Breaking the scalability limit for RDMA[C]//2021 IEEE 29th International Conference on Network Protocols (ICNP). IEEE, 2021: 1-11.
[30]Wang Z, Luo L, Ning Q, et al. SRNIC: A Scalable Architecture for RDMA NICs[C]//20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 2023: 1-14.
[31]Chen Y, Lu Y, Shu J. Scalable RDMA RPC on reliable connection with efficient resource sharing[C]//Proceedings of the Fourteenth EuroSys Conference 2019. 2019: 1-14.
[32]Wang Z, Huang H, Zhang J, et al. {FpgaNIC}: An {FPGA-based} versatile 100gb {SmartNIC} for {GPUs}[C]//2022 USENIX Annual Technical Conference (USENIX ATC 22). 2022: 967-986.

选择文件类型/文献管理软件名称

选择包含的内容