Falcon后量子算法的密钥树生成部件GPU并行优化设计与实现

doi:10.19678/j.issn.1000-3428.0068304

摘要/Abstract

摘要：

近年来, 后量子密码算法因其具有抗量子攻击的特性成为安全领域的研究热点。基于格的Falcon数字签名算法是美国国家标准与技术研究所(NIST)公布的首批4个后量子密码标准算法之一。密钥树生成是Falcon算法的核心部件, 在实际运算中占用较多的时间和消耗较多的资源。为此, 提出一种基于图形处理器(GPU)的Falcon密钥树并行生成方案。该方案使用奇偶线程联合控制的单指令多线程(SIMT)并行模式和无中间变量的直接计算模式, 达到了提升速度和减少资源占用的目的。基于Python的CUDA平台进行了实验, 验证结果的正确性。实验结果表明, Falcon密钥树生成在RTX 3060 Laptop的延迟为6 ms, 吞吐量为167次/s, 在计算单个Falcon密钥树生成部件时相对于CPU实现了1.17倍的加速比, 在同时并行1 024个Falcon密钥树生成部件时, GPU相对于CPU的加速比达到了约56倍, 在嵌入式Jetson Xavier NX平台上的吞吐量为32次/s。

关键词: 后量子密码, Falcon算法, 图形处理器, CUDA平台, 并行计算

Abstract:

Recently, post-quantum cryptographic algorithms have become a popular research topic in the field of security owing to their resistance to quantum attacks. The lattice-based Falcon digital signature algorithm is one of the first four post-quantum cryptographic standard algorithms published by NIST. Key tree generation is the core component of the Falcon algorithm, which requires more time and consumes more resources during actual operation. Therefore, this study proposes a GPU-based parallel key tree generation scheme for Falcon that uses the Single Instruction Multiple Threads(SIMT) parallel mode with joint control of parity threads and the direct computation mode without intermediate variables to achieve speedup and reduce resource consumption. Experiments are conducted on a Python-based CUDA platform to verify the accuracy of the results. Falcon key tree generation for the RTX 3060 Laptop has a latency of 6 ms and a throughput rate of 167 times/s. It achieves a 1.17 acceleration ratio relative to the CPU when computing a single Falcon key tree generating part, whereas the GPU achieves an approximately 56 acceleration ratio relative to the CPU when 1 024 Falcon key tree generating parts are generated simultaneously; the throughput rate is 32 times/s on the embedded Jetson Xavier NX platform.

Key words: post-quantum cryptography, Falcon algorithm, Graphics Processing Unit (GPU), CUDA platform, parallel computing

张磊, 赵光岳, 肖超恩, 王建新. Falcon后量子算法的密钥树生成部件GPU并行优化设计与实现[J]. 计算机工程, 2024, 50(9): 208-215.

ZHANG Lei, ZHAO Guangyue, XIAO Chaoen, WANG Jianxin. GPU Parallel Optimal Design and Implementation of Key Tree Generation Components for Falcon Post-Quantum Algorithms[J]. Computer Engineering, 2024, 50(9): 208-215.

https://www.ecice06.com/CN/Y2024/V50/I9/208

图/表 14

图1 粗粒度与细粒度并行

Fig.1 Coarse-grained and fine-grained in parallel

图2 后量子算法GPU完整并行方案

Fig.2 Scheme of post-quantum algorithm GPU complete parallel

图3 Falcon密钥生成关键流程

Fig.3 Key procedure of Falcon key generation

图4 ffLDL*模块递归过程

Fig.4 Recursive process of ffLDL* module

图5 ffLDL*模块并行设计

Fig.5 Parallel design of ffLDL* modules

图6 Falcon密钥树细化输出形式

Fig.6 Output form of Falcon key tree refinement

图7 本文GPU密钥树实现结果(部分)

Fig.7 Results of GPU key tree implementation in this paper (partial)

图8 Falcon参考实现代码的密钥树结果(部分)

Fig.8 Key tree results for Falcon reference implementation code (partial)

图9 ffLDL*模块性能测试

Fig.9 Performance test of ffLDL* module

图10 不同Falcon密钥树生成部件并行粗粒度个数下的GPU-CPU加速比

Fig.10 The GPU-CPU acceleration ratio for the number of parallel coarse-grained components generated by different Falcon key trees

图11 不同Falcon密钥树生成部件并行粗粒度个数下的GPU运行延迟

Fig.11 GPU running delay with different number of parallel coarse-grained components generated by Falcon key tree

参考文献 23

1	KAUR R, KAUR A. Digital signature[C]//Proceedings of the International Conference on Computing Sciences. Washington D. C., USA: IEEE Press, 2012: 295-301.
2	BRAUNSTEIN S L, VAN LOOCK P. Quantum information with continuous variables. Reviews of Modern Physics, 2005, 77 (2): 513- 577. doi: 10.1103/RevModPhys.77.513
3	BERNSTEIN D J, LANGE T. Post-quantum cryptography. Nature, 2017, 549, 188- 194. doi: 10.1038/nature23461
4	CHEN L, JORDAN S, LIU Y K, et al. Report on post-quantum cryptography[EB/OL]. [2023-07-20]. https://csrc.nist.gov/files/pubs/ir/8105/final/docs/nistir_8105_draft.pdf.
5	HOWE J, PREST T, APON D. SoK: how (not) to design and implement post-quantum cryptography[C]//Proceedings of Cryptographers' Track at the RSA Conference. Berlin, Germany: Springer, 2021: 444-477.
6	FOUQUE P A, HOFFSTEIN J, KIRCHNER P, et al. Falcon: fast-Fourier lattice-based compact signatures over NTRU. Submission to the NIST's post-quantum cryptography standardization process, 2018, 36 (5): 1- 10.
7	SONI D, BASU K, NABEEL M, et al. Hardware architectures for post-quantum digital signature schemes. Berlin, Germany: Springer, 2021.
8	LEE W K, ZHAO R K, STEINFELD R, et al. High throughput lattice-based signatures on GPUs: comparing falcon and mitaka. IEEE Transactions on Parallel and Distributed Systems, 2024, 35 (4): 675- 692. doi: 10.1109/TPDS.2024.3367319
9	何诗洋, 李晖, 李凤华. 面向格基密码体制的高效硬件实现研究综述. 密码学报, 2021, 8 (6): 1019- 1038. URL
	HE S Y, LI H, LI F H. A survey on high-efficiency hardware implementation for lattice-based cryptosystem. Journal of Cryptologic Research, 2021, 8 (6): 1019- 1038. URL
10	曹元, 陆旭, 吴彦泽, 等. 后量子加密算法的硬件实现综述. 信息安全学报, 2021, 6 (6): 1- 16. URL
	CAO Y, LU X, WU Y Z, et al. The survey of post-quantum cryptography hardware implementation. Journal of Cyber Security, 2021, 6 (6): 1- 16. URL
11	李斌, 陈晓杰, 冯峰, 等. 后量子密码CRYSTALS-Kyber的FPGA多路并行优化实现. 通信学报, 2022, 43 (2): 196- 207. URL
	LI B, CHEN X J, FENG F, et al. FPGA multi-unit parallel optimization and implementation of post-quantum cryptography CRYSTALS-Kyber. Journal on Communications, 2022, 43 (2): 196- 207. URL
12	张贺, 王鹏, 李思照. 基于格的后量子密码系统研究. 无线电工程, 2022, 52 (8): 1310- 1321. URL
	ZHANG H, WANG P, LI S Z. Research on lattice-based post-quantum cryptosystem. Radio Engineering, 2022, 52 (8): 1310- 1321. URL
13	杨嘉宇. 抗量子密码的研究与应用[D]. 西安: 西安电子科技大学, 2021.
	YANG J Y. Research and application of anti quantum cryptography[D]. Xi'an: Xidian University, 2021. (in Chinese)
14	吴玉鹏. 基于NewHope协议的后量子密码算法芯片的研究与设计[D]. 济南: 山东大学, 2021.
	WU Y P. Research and design of post-quantum cryptography algorithm chip based on NewHope protocol[D]. Jinan: Shandong University, 2021. (in Chinese)
15	易海博. 有限域运算和多变量公钥密码硬件的优化和设计[D]. 广州: 华南理工大学, 2015.
	YI H B. Design and improvement of finite field arithmetic and multivariate public key cryptographic hardware[D]. Guangzhou: South China University of Technology, 2015. (in Chinese)
16	郭丽敏, 刘丹, 王立辉, 等. 一种适合资源受限设备的Falcon实现. 微电子学与计算机, 2020, 37 (9): 50-55, 61. URL
	GUO L M, LIU D, WANG L H, et al. A practical implementation of the signature scheme Falcon suited for memory constrained device. Microelectronics & Computer, 2020, 37 (9): 50-55, 61. URL
17	LEE K, GOWANLOCK M, CAMBOU B. SABER-GPU: a response-based cryptography algorithm for SABER on the GPU[C]//Proceedings of the 26th Pacific Rim International Symposium on Dependable Computing (PRDC). Washington D. C., USA: IEEE Press, 2021: 123-132.
18	GUPTA N, JATI A, CHAUHAN A K, et al. PQC acceleration using GPUs: FrodoKEM, NewHope, and Kyber. IEEE Transactions on Parallel and Distributed Systems, 2021, 32 (3): 575- 586. doi: 10.1109/TPDS.2020.3025691
19	WAN L P, ZHENG F Y, FAN G, et al. A novel high-performance implementation of CRYSTALS-kyber with AI accelerator[C]//Proceedings of European Symposium on Research in Computer Security. Berlin, Germany: Springer, 2022: 514-534.
20	LEE W K, HWANG S O. High throughput implementation of post-quantum key encapsulation and decapsulation on GPU for Internet of Things applications. IEEE Transactions on Services Computing, 2022, 15 (6): 3275- 3288. doi: 10.1109/TSC.2021.3103956
21	WRIGHT J, GOWANLOCK M, PHILABAUM C, et al. A CRYSTALS-dilithium response-based cryptography engine using GPGPU[C]//Proceedings of the Future Technologies Conference. Berlin, Germany: Springer, 2022: 32-45.
22	SEO S C, AN S. Parallel implementation of CRYSTALS-Dilithium for effective signing and verification in autonomous driving environment. ICT Express, 2023, 9 (1): 100- 105. doi: 10.1016/j.icte.2022.08.003
23	SUN S Z, ZHANG R, MA H. Efficient parallelism of post-quantum signature scheme SPHINCS. IEEE Transactions on Parallel and Distributed Systems, 2020, 31 (11): 2542- 2555. doi: 10.1109/TPDS.2020.2995562

[1]	杨太龙, 赵红朋, 张磊. 基于国产异构平台的奇异值分解法[J]. 计算机工程, 2024, 50(9): 216-225.
[2]	雷斗威, 何德彪, 罗敏, 彭聪. 基于AVX512的格密码高速并行实现[J]. 计算机工程, 2024, 50(2): 15-24.
[3]	黄斌, 柳安军, 潘景山, 田敏, 张煜, 朱光慧. 基于GPU的LBM迁移模块算法优化[J]. 计算机工程, 2024, 50(2): 232-238.
[4]	王其涵, 庞建民, 岳峰, 祝迪, 沈莉, 肖谦. 面向申威架构的KNN并行算法实现与优化[J]. 计算机工程, 2023, 49(5): 286-294.
[5]	夏立斌, 刘晓宇, 姜晓巍, 孙功星. 基于分布式数据集的并行计算框架内存优化方法[J]. 计算机工程, 2023, 49(4): 43-51.
[6]	林琳, 祝爱琦, 赵明璨, 张帅, 叶炎昊, 徐骥, 韩林, 赵荣彩, 侯超峰. 晶硅分子动力学模拟的GPU加速算法优化[J]. 计算机工程, 2023, 49(4): 166-173.
[7]	李靖, 祝爱琦, 韩林, 侯超峰. 基于GPU的固态晶体硅分子动力学算法优化[J]. 计算机工程, 2023, 49(3): 288-295.
[8]	房俊, 薛晓东, 周云亮. 基于深度生成模型的聚合查询区间估计方法[J]. 计算机工程, 2023, 49(11): 284-292, 301.
[9]	张晓东, 陈韬伟, 余益民. 一种基于LWE‐CPABE的区块链数据共享方案[J]. 计算机工程, 2022, 48(10): 158-168,175.
[10]	黄瑞, 金光浩, 李磊, 姜文超, 宋庆增. 轻量化神经网络加速器的设计与实现[J]. 计算机工程, 2021, 47(9): 185-190,196.
[11]	易培淮, 李卫东, 林韬, 邹佳恒, 邓子艳, 刘言. GPU在缪子快速模拟中的应用[J]. 计算机工程, 2021, 47(8): 100-108.
[12]	肖汉, 郭宝云, 李彩林, 周清雷. 面向异构架构的传递闭包并行算法[J]. 计算机工程, 2021, 47(8): 131-139.
[13]	佘鑫, 何震瀛. 复杂属性条件下基于Spark的clique社区搜索算法[J]. 计算机工程, 2021, 47(12): 54-61,70.
[14]	郭渝洛, 边浩东, 董润婷, 唐嘉豪, 王晓英, 黄建强. 基于SIMD的并行傅里叶空间图像相似度计算[J]. 计算机工程, 2021, 47(11): 247-253.
[15]	肖成龙, 聂紫阳, 王宁, 张重鹏, 王珊珊. 基于并行约束规划的最大团识别研究[J]. 计算机工程, 2020, 46(4): 53-59,69.

选择文件类型/文献管理软件名称

选择包含的内容