Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering

   

Parallel Optimal Design and Implementation of Key Tree Generation Components for Falcon Post-Quantum Algorithms on GPU

  

  • Online:2024-04-09 Published:2024-04-09

Falcon后量子算法的密钥树生成部件GPU并行优化设计与实现

Abstract: In recent years post-quantum cryptographic algorithms have become a hot research topic in the field of security due to their resistance to quantum attacks. The lattice based Falcon digital signature algorithm is one of the first four post-quantum cryptographic standard algorithms published by NIST. Key tree generation is the core component of Falcon algorithm, which takes more time and consumes more resources in the actual operation. Therefore, proposes a GPU-based parallel key tree generation scheme for Falcon, which uses SIMT parallel mode with joint control of parity threads and direct computation mode without intermediate variables to achieve speedup and reduce resource consumption. Experiments are conducted on a python-based CUDA platform to verify the correctness of the results. Falcon key tree generation for RTX 3060 laptop has a latency of 6ms and a throughput rate of 167 times/s, It achieves a 1.17x acceleration ratio relative to the CPU when computing a single Falcon tree generating part, where the GPU achieves a approximately 56x acceleration ratio relative to the CPU when 1024 Falcon tree generating parts are generated simultaneously; the throughput rate is 32/s on the embedded Jetson Xavier NX platform.

摘要: 近年来后量子密码算法由其具有抗量子攻击的特性成为安全领域的研究热点。基于格的Falcon数字签名算法是NIST公布的首批4个后量子密码标准算法之一。密钥树生成是Falcon算法的核心部件,在实际运算中占用较多的时间和消耗较多的资源。因此,提出了一种基于GPU的Falcon密钥树并行生成方案,该方案使用奇偶线程联合控制的SIMT并行模式和无中间变量的直接计算模式,达到了提升速度和减少资源占用的目的。基于Python的CUDA平台进行了实验,验证了结果的正确性。Falcon密钥树生成在RTX 3060 laptop的延迟为6ms,吞吐率为167次/s,在计算单个Falcon树生成部件时相对于CPU实现了1.17倍的加速比,其中在同时并行1024个Falcon树生成部件时,GPU相对于CPU的加速比达到了约56倍。同时在嵌入式Jetson Xavier NX平台上的吞吐率为32次/s。