作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (9): 208-215. doi: 10.19678/j.issn.1000-3428.0068304

• 体系结构与软件技术 • 上一篇    下一篇

Falcon后量子算法的密钥树生成部件GPU并行优化设计与实现

张磊1, 赵光岳2,*(), 肖超恩1, 王建新1   

  1. 1. 北京电子科技学院电子与通信工程系, 北京 100070
    2. 北京电子科技学院网络空间安全系, 北京 100070
  • 收稿日期:2023-08-28 出版日期:2024-09-15 发布日期:2024-04-09
  • 通讯作者: 赵光岳
  • 基金资助:
    国家重点研发计划(2017YFB0801803); 中央高校基本科研业务费资金(328202278); 中央高校基本科研业务费资金(328202261); 中央高校基本科研业务费资金(3282023005); 北京高等教育"本科教学改革创新项目"(202110018002); 北京电子科技学院一流学科建设项目(20210064Z0401)

GPU Parallel Optimal Design and Implementation of Key Tree Generation Components for Falcon Post-Quantum Algorithms

ZHANG Lei1, ZHAO Guangyue2,*(), XIAO Chaoen1, WANG Jianxin1   

  1. 1. Department of Electronic and Communication Engineering, Beijing Electronic Science and Technology Institute, Beijing 100070, China
    2. Department of Cyberspace Security, Beijing Electronic Science and Technology Institute, Beijing 100070, China
  • Received:2023-08-28 Online:2024-09-15 Published:2024-04-09
  • Contact: ZHAO Guangyue

摘要:

近年来, 后量子密码算法因其具有抗量子攻击的特性成为安全领域的研究热点。基于格的Falcon数字签名算法是美国国家标准与技术研究所(NIST)公布的首批4个后量子密码标准算法之一。密钥树生成是Falcon算法的核心部件, 在实际运算中占用较多的时间和消耗较多的资源。为此, 提出一种基于图形处理器(GPU)的Falcon密钥树并行生成方案。该方案使用奇偶线程联合控制的单指令多线程(SIMT)并行模式和无中间变量的直接计算模式, 达到了提升速度和减少资源占用的目的。基于Python的CUDA平台进行了实验, 验证结果的正确性。实验结果表明, Falcon密钥树生成在RTX 3060 Laptop的延迟为6 ms, 吞吐量为167次/s, 在计算单个Falcon密钥树生成部件时相对于CPU实现了1.17倍的加速比, 在同时并行1 024个Falcon密钥树生成部件时, GPU相对于CPU的加速比达到了约56倍, 在嵌入式Jetson Xavier NX平台上的吞吐量为32次/s。

关键词: 后量子密码, Falcon算法, 图形处理器, CUDA平台, 并行计算

Abstract:

Recently, post-quantum cryptographic algorithms have become a popular research topic in the field of security owing to their resistance to quantum attacks. The lattice-based Falcon digital signature algorithm is one of the first four post-quantum cryptographic standard algorithms published by NIST. Key tree generation is the core component of the Falcon algorithm, which requires more time and consumes more resources during actual operation. Therefore, this study proposes a GPU-based parallel key tree generation scheme for Falcon that uses the Single Instruction Multiple Threads(SIMT) parallel mode with joint control of parity threads and the direct computation mode without intermediate variables to achieve speedup and reduce resource consumption. Experiments are conducted on a Python-based CUDA platform to verify the accuracy of the results. Falcon key tree generation for the RTX 3060 Laptop has a latency of 6 ms and a throughput rate of 167 times/s. It achieves a 1.17 acceleration ratio relative to the CPU when computing a single Falcon key tree generating part, whereas the GPU achieves an approximately 56 acceleration ratio relative to the CPU when 1 024 Falcon key tree generating parts are generated simultaneously; the throughput rate is 32 times/s on the embedded Jetson Xavier NX platform.

Key words: post-quantum cryptography, Falcon algorithm, Graphics Processing Unit (GPU), CUDA platform, parallel computing