作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

XiRang:可扩展的高性能地址映射结构设计

  • 发布日期:2025-11-05

XiRang: Scalable High-Performance Address Mapping Structure

  • Published:2025-11-05

摘要: 新兴业务在数据中心中引入了大量的大粒度RDMA通信需求.RDMA依赖物理地址通信,大粒度访问时地址翻译所需使用的地址翻译页表项(Page Table Entry, PTE)超过了硬件设备的缓存空间.目前高性能的商用方案使用主机内存储存PTE,但这种结构使大粒度访存只能在读取主机内存的PTE之后才能执行传输,这引入了PCIe穿透和主机内存访问延迟,导致地址翻译效率严重下降和主机CPU开销.为实现高效的大粒度RDMA,本文设计可配置的高性能地址映射结构:息壤(XiRang).XiRang通过流取机制和层级缓存设计高效扩展访存粒度,通过可配置的地址翻译阵列实现了灵活、高吞吐的地址翻译性能.XiRang原型基于DPU实现,实验表明:1) XiRang有效卸载RDMA数据面地址翻译负荷,与主机CPU解耦;2) XiRang使用的流取扩展机制有效降低存储开销,并发模式下高速缓存消耗仅为10 Byte级别,并发的存储开销可忽略不计;3) XiRang在高并发访存请求数量下,保持接近100%的翻译表项查询命中率,翻译引擎空闲时间相比RNIC架构降低2到3个数量级;4) XiRang的翻译吞吐是RNIC翻译架构的60倍以上,是基础DPU地址映射结构的3.5倍以上;5) 使用性能增强模式,XiRang的地址翻译速度可支持1.4 TB/s的数据传输带宽.

Abstract: Emerging applications in datacenters have introduced a significant amount of large-granularity RDMA communication requirements. RDMA relies on physical addresses, and, when accessing large-granularity data, the Page Table Entries (PTEs) required for address translation exceed the cache capacity of hardware devices. Current high-performance commercial solutions store PTEs in the host memory. However, this architecture requires large-granularity communication to be executed only after fetching the PTEs from the host memory, which introduces PCIe traversal and host memory access latency, severely degrading address translation efficiency and increasing host CPU overhead. To achieve efficient large-granularity RDMA, this paper designs a configurable high-performance address mapping structure: XiRang. XiRang efficiently extends the access granularity through a streaming prefetch mechanism and a hierarchical cache design, and implements flexible and high-throughput address translation performance through a configurable address translation array. The XiRang prototype is implemented based on a DPU. Experiments show that: 1) XiRang effectively offloads the address translation load of the RDMA data plane, decoupling it from the host CPU; 2) The streaming prefetch extension mechanism used by XiRang effectively reduces storage overhead, with cache consumption at only the 10-byte level under concurrent modes, and concurrent storage overhead being negligible; 3) Under a high number of concurrent memory access requests, XiRang maintains a translation table entry query hit rate close to 100%, reducing the idle time of the translation engine by 2 to 3 orders of magnitude compared to the RNIC architecture; 4) The translation throughput of XiRang is more than 60 times that of the RNIC translation architecture and more than 3.5 times that of the basic DPU address mapping structure; 5) In performance enhancement mode, XiRang's address translation speed can support a data transfer bandwidth of 1.4 TB/s.