Design and Implementation of Domain-Specific Low-Latency and High-Bandwidth TCP/IP Offload Engine

doi:10.19678/j.issn.1000-3428.0063200

Abstract

Abstract: In response to the low-latency and high-bandwidth requirements for data transmission in quantitative high-frequency trading application scenarios, a domain-specific Transmission Control Protocol/Internet Protocol (TCP/IP) protocol stack has been customized and offloaded to a dedicated hardware acceleration module.A modular design is adopted to realize the special hardware logic, and together with the fast protocol hardware acceleration module, a complete high-frequency trading system with low delay and high bandwidth is built.By adjusting the Maximum Segment Size(MSS), 64 Byte data alignment is achieved, the read/write speed between the kernel and High Bandwidth Memory(HBM) is improved, and the memory structure is optimized to realize a 4-channel parallel read/write management between the host and the HBM.The data flow of each functional module and the data for verification and calculation are optimized, and finally a full pipeline architecture is built.The AXI4-Stream interface is used to connect the modules, by passing the memory for data transmission and improving the transmission performance.The experimental results show that the TCP/IP offload engine can obtain a network throughput of 38.28 Gb/s on Xilinx Alevo U50 data center accelerator card, with the lowest basic network communication puncturing delay of 468.4 ns, and the delay of 677.9 ns after the fast decoding protocol is superimposed. Compared with the traditional software processing network stack (Intel i9-9900x+9802BF), the throughput of the TCP/IP engine is increased by one time, the delay is reduced to 1/12, and the delay is stable, with a fluctuation range of approximately 10 ns.While meeting the needs of quantifying high-frequency trading scenarios, it effectively reduces the payload on the CPU.

Key words: domain-specific, Transmission Control Protocal/Interner Protocal(TCP/IP) offload engine, low-latency and high bandwidth, Field Programmable Gate Array(FPGA), Open Computing Language(OpenCL)

摘要： 针对量化高频交易应用场景对数据传输低延迟高带宽的需求，定制一种领域专用的TCP/IP协议栈，并将其卸载到专用硬件加速模块上。采用模块化设计实现专用硬件逻辑，并与FAST协议硬件加速模块共同构成完整的低延迟高带宽高频交易系统。通过调整最大报文长度，实现64 Byte数据对齐，提升内核与高带宽内存（HBM）间的读写速率，并对内存结构进行优化，实现主机端与HBM间的4通道并行读写管理。对各功能模块进行数据流优化，最终构建全流水线架构。模块间统一使用AXI4-Stream接口连接，并绕过内存进行数据传输，实现传输性能的提升。实验结果表明，TCP/IP卸载引擎在Xilinx Alevo U50数据中心加速卡上可获得38.28 Gb/s的网络吞吐率，基础网络通信穿刺延迟最低为468.4 ns，在叠加FAST解码协议后延迟为677.9 ns，与传统软件处理网络堆栈（Intel i9-9900x+9802BF）的方式相比，TCP/IP引擎的吞吐率提升1倍，延迟降低为1/12，且延迟稳定，波动范围在10 ns左右，在满足量化高频交易场景需要的同时，有效减轻了CPU的负载。

关键词: 领域专用, 传输控制协议/互联网协议卸载引擎, 高带宽低延迟, 可编程逻辑门阵列, 开放运算语言

CLC Number:

TP311

FENG Yifei, DING Nan, YE Junchao, CHAI Zhilei. Design and Implementation of Domain-Specific Low-Latency and High-Bandwidth TCP/IP Offload Engine[J]. Computer Engineering, 2022, 48(9): 162-170.

冯一飞, 丁楠, 叶钧超, 柴志雷. 领域专用低延迟高带宽TCP/IP卸载引擎设计与实现[J]. 计算机工程, 2022, 48(9): 162-170.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0063200

http://www.ecice06.com/EN/Y2022/V48/I9/162

Figures/Tables 21

References

[1] YAN Y, BELDACHI A F, NEJABATI R, et al.P4-enabled smart NIC:enabling sliceable and service-driven optical data centres[J].Journal of Lightwave Technology, 2020, 38(9):2688-2694.
[2] SHANTHARAMA P, THYAGATURU A S, REISSLEIN M.Hardware-accelerated platforms and infrastructures for network functions:a survey of enabling technologies and research studies[J].IEEE Access, 8:132021-132085.
[3] 马潇潇, 杨帆, 王展, 等.智能网卡综述[J].计算机研究与发展, 2022, 59(1):1-21. MA X X, YANG F, WANG Z, et al.Survey on smart network interface card[J].Journal of Computer Research and Development, 2022, 59(1):1-21.(in Chinese)
[4] DING L, KANG P, YIN W B, et al.Hardware TCP offload engine based on 10-Gbps ethernet for low-latency network communication[C]//Proceedings of International Conference on Field-Programmable Technology.Washington D.C., USA:IEEE Press, 2017:269-272.
[5] 李肖瑶.基于FPGA的高性能网络功能加速平台[D].武汉:华中科技大学, 2018. LI X Y.FPGA-based high performance network function accelerating platform[D].Wuhan:Huazhong University of Science and Technology, 2018.(in Chinese)
[6] QIU Y M, KANG Q, LIU M, et al.Clara:performance clarity for SmartNIC offloading[C]//Proceedings of the 19th ACM Workshop on Hot Topics in Networks.New York, USA:ACM Press, 2020:13-17.
[7] KLAISOONGNOEN M, BROWN N, BROWN O.I feel the need for speed:exploiting latest generation FPGAs in providing new capabilities for high frequency trading[C]//Proceedings of the 11th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies.New York, USA:Association for Computing Machinery, 2021:1-12.
[8] BOUTROS A, GRADY B, ABBAS M, et al.Build fast, trade fast:FPGA-based high-frequency trading using high-level synthesis[C]//Proceedings of International Conference on ReConFigurable Computing and FPGAs.Washington D.C., USA:IEEE Press, 2017:1-6.
[9] ESKANDARI N, TARAFDAR N, LY-MA D, et al.A modular heterogeneous stack for deploying FPGAs and CPUs in the data center[C]//Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.New York, USA:ACM Press, 2019:12-19.
[10] DVOŘÁK M, KOŘENEK J.Low latency book handling in FPGA for high frequency trading[C]//Proceedings of the 17th International Symposium on Design and Diagnostics of Electronic Circuits & Systems.Washington D.C., USA:IEEE Press, 2014:175-178.
[11] TANG Q, SU M J, JIANG L, et al.A scalable architecture for low-latency market-data processing on FPGA[C]//Proceedings of IEEE Symposium on Computers and Communication.Washington D.C., USA:IEEE Press, 2016:597-603.
[12] 李景欣.基于Vitis的FPGA目标检测算法加速器设计[D].大连:大连理工大学, 2021. LI J X.Accelerator design of FPGA target detection algorithm based on vitis[D].Dalian:Dalian University of Technology, 2021.(in Chinese)
[13] PUŠ V, KEKELY L, KOŘENEK J.Low-latency modular packet header parser for FPGA[C]//Proceedings of ACM/IEEE Symposium on Architectures for Networking and Communications Systems.Washington D.C., USA:IEEE Press, 2012:77-78.
[14] HU J X, WANG J F, LI R G.Low-latency ultra-wideband high-speed transmission protocol based on FPGA[J].Journal of Physics:Conference Series, 2020, 1621(1):66-76.
[15] BANSOD R, VIRK R, RAVAL M.Low latency, high throughput trade surveillance system using in-memory data grid[C]//Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems.New York, USA:ACM, 2018:250-253.
[16] 王琉.国内证券高速实时行情平台设计与实现[D].上海:上海交通大学, 2017. WANG L.The desigh and implementation of real-time domestic market data platform[D].Shanghai:Shanghai Jiao Tong University, 2017.(in Chinese)
[17] COOKE R A, FAHMY S A.Characterizing latency overheads in the deployment of FPGA accelerators[C]//Proceedings of the 30th International Conference on Field-Programmable Logic and Applications.Washington D.C., USA:IEEE Press, 2020:347-352.
[18] 谢希仁.计算机网络[M].北京:电子工业出版社, 2017. XIE X R.计算机网络[M].Beijing:Publishing House of Electronics industry, 2017.(in Chinese)
[19] CHOUDHARY A, PORWAL D, PARMAR A.FPGA based solution for Ethernet controller as alternative for TCP/UDP software stack[C]//Proceedings of the 6th Edition of International Conference on Wireless Networks & Embedded Systems.Washington D.C., USA:IEEE Press, 2017:63-66.
[20] D'SOUZA J, KAUR M J, MOHAMAD H A, et al.Transmission Control Protocol (TCP) delay analysis in real time network[C]//Proceedings of Advances in Science and Engineering Technology International Conferences.Washington D.C., USA:IEEE Press, 2018:1-6.
[21] Xilinx Corporation.10G/25G High speed ethernet subsystem for v3.3 product guide[EB/OL].[2021-10-01].https://china.xilinx.com/content/dam/xilinx/support/documentation/ip_documentation/xxv_ethernet/v3_3/pg210-25g-ethernet.pdf.
[22] Xilinx Corporation.Vitis unified platform application acceleration[EB/OL].[2021-10-01].https://www.xilinx.com/support/documentation/sw_manuals/xilinx2021_1/ug1393-vitis-application-acceleration.pdf.
[23] Xilinx Corporation.Vitis high-level synthesis user guide[EB/OL].[2021-10-01].https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_2/ug1399-vitis-hls.pdf.
[24] Xilinx Corporation.Zynq UltraScale+ MPSoC software developer guide[EB/OL].[2021-10-01].https://www.xilinx.com/support/documentation/sw_manuals/xilinx2021_2/ug1137-zynq-ultrascale-mpsoc-swdev.pdf.

Please choose a citation manager

Content to export