Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2022, Vol. 48 ›› Issue (9): 162-170. doi: 10.19678/j.issn.1000-3428.0063200

• Computer Architecture and Software Technology • Previous Articles     Next Articles

Design and Implementation of Domain-Specific Low-Latency and High-Bandwidth TCP/IP Offload Engine

FENG Yifei1, DING Nan2, YE Junchao2, CHAI Zhilei2,3   

  1. 1. School of Internet of Things Engineering, Jiangnan University, Wuxi, Jiangsu 214122, China;
    2. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu 214122, China;
    3. Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence, Wuxi, Jiangsu 214122, China
  • Received:2021-11-10 Revised:2021-12-13 Published:2021-12-15

领域专用低延迟高带宽TCP/IP卸载引擎设计与实现

冯一飞1, 丁楠2, 叶钧超2, 柴志雷2,3   

  1. 1. 江南大学 物联网工程学院, 江苏 无锡 214122;
    2. 江南大学 人工智能与计算机学院, 江苏 无锡 214122;
    3. 江苏省模式识别与计算智能工程实验室, 江苏 无锡 214122
  • 作者简介:冯一飞(1997—),男,硕士研究生,主研方向为软件定义的高效计算机系统、软硬件协同设计;丁楠、叶钧超,硕士研究生;柴志雷,教授、博士。
  • 基金资助:
    国家自然科学基金(61972180)。

Abstract: In response to the low-latency and high-bandwidth requirements for data transmission in quantitative high-frequency trading application scenarios, a domain-specific Transmission Control Protocol/Internet Protocol (TCP/IP) protocol stack has been customized and offloaded to a dedicated hardware acceleration module.A modular design is adopted to realize the special hardware logic, and together with the fast protocol hardware acceleration module, a complete high-frequency trading system with low delay and high bandwidth is built.By adjusting the Maximum Segment Size(MSS), 64 Byte data alignment is achieved, the read/write speed between the kernel and High Bandwidth Memory(HBM) is improved, and the memory structure is optimized to realize a 4-channel parallel read/write management between the host and the HBM.The data flow of each functional module and the data for verification and calculation are optimized, and finally a full pipeline architecture is built.The AXI4-Stream interface is used to connect the modules, by passing the memory for data transmission and improving the transmission performance.The experimental results show that the TCP/IP offload engine can obtain a network throughput of 38.28 Gb/s on Xilinx Alevo U50 data center accelerator card, with the lowest basic network communication puncturing delay of 468.4 ns, and the delay of 677.9 ns after the fast decoding protocol is superimposed. Compared with the traditional software processing network stack (Intel i9-9900x+9802BF), the throughput of the TCP/IP engine is increased by one time, the delay is reduced to 1/12, and the delay is stable, with a fluctuation range of approximately 10 ns.While meeting the needs of quantifying high-frequency trading scenarios, it effectively reduces the payload on the CPU.

Key words: domain-specific, Transmission Control Protocal/Interner Protocal(TCP/IP) offload engine, low-latency and high bandwidth, Field Programmable Gate Array(FPGA), Open Computing Language(OpenCL)

摘要: 针对量化高频交易应用场景对数据传输低延迟高带宽的需求,定制一种领域专用的TCP/IP协议栈,并将其卸载到专用硬件加速模块上。采用模块化设计实现专用硬件逻辑,并与FAST协议硬件加速模块共同构成完整的低延迟高带宽高频交易系统。通过调整最大报文长度,实现64 Byte数据对齐,提升内核与高带宽内存(HBM)间的读写速率,并对内存结构进行优化,实现主机端与HBM间的4通道并行读写管理。对各功能模块进行数据流优化,最终构建全流水线架构。模块间统一使用AXI4-Stream接口连接,并绕过内存进行数据传输,实现传输性能的提升。实验结果表明,TCP/IP卸载引擎在Xilinx Alevo U50数据中心加速卡上可获得38.28 Gb/s的网络吞吐率,基础网络通信穿刺延迟最低为468.4 ns,在叠加FAST解码协议后延迟为677.9 ns,与传统软件处理网络堆栈(Intel i9-9900x+9802BF)的方式相比,TCP/IP引擎的吞吐率提升1倍,延迟降低为1/12,且延迟稳定,波动范围在10 ns左右,在满足量化高频交易场景需要的同时,有效减轻了CPU的负载。

关键词: 领域专用, 传输控制协议/互联网协议卸载引擎, 高带宽低延迟, 可编程逻辑门阵列, 开放运算语言

CLC Number: