面积高效的格密码多项式乘法硬件实现

doi:10.19678/j.issn.1000-3428.0069229

计算机工程 ›› 2026, Vol. 52 ›› Issue (1): 282-292. doi: 10.19678/j.issn.1000-3428.0069229

面积高效的格密码多项式乘法硬件实现

谢家兴¹, 蒲金伟¹, 方伟钿¹, 郑欣²^,*(), 熊晓明²

1. 广东工业大学自动化学院, 广东广州 510006
2. 广东工业大学集成电路学院, 广东广州 510006

收稿日期:2024-01-15 修回日期:2024-07-24 出版日期:2026-01-15 发布日期:2024-10-15
通讯作者: 郑欣
作者简介:
谢家兴, 男, 硕士研究生, 主研方向为公钥密码算法硬件加速
蒲金伟, 硕士研究生
方伟钿, 硕士研究生
郑欣(通信作者), 副教授、博士
熊晓明, 教授、博士
基金资助:
广东省基础与应用基础研究基金(2021A1515110777); 广东省重点领域研发计划(2022B0701180001)

Area-Efficient Polynomial Multiplication Hardware Implementation for Lattice-based Cryptography

XIE Jiaxing¹, PU Jinwei¹, FANG Weitian¹, ZHENG Xin²^,*(), XIONG Xiaoming²

1. School of Automation, Guangdong University of Technology, Guangzhou 510006, Guangdong, China
2. School of Integrated Circuits, Guangdong University of Technology, Guangzhou 510006, Guangdong, China

Received:2024-01-15 Revised:2024-07-24 Online:2026-01-15 Published:2024-10-15
Contact: ZHENG Xin

摘要/Abstract

摘要：

基于格的后量子密码算法在公钥密码领域具有广泛的应用前景, 多项式乘法的计算复杂性是其硬件实现的主要性能瓶颈。针对多项式乘法实现存在的面积效率低和内存映射冲突等问题, 提出一种基于部分数论变换(PNTT)和系数交叉运算(CCO)的多项式乘法结构。首先, 将数论变换(NTT)最后一轮、系数相乘和逆数论变换(INTT)第一轮融合成CCO, 减少2轮蝶形运算和50%的旋转因子存储空间, 降低内存访问开销; 其次, 采用轻量级硬件分别实现模加、模减、除2运算以及优化后的基于Barrett的模乘运算, 有效减少逻辑资源开销, 同时采用流水线、分时复用技术设计可重构运算单元(PE)阵列, 使得各运算单元可以在不同变换下进行高效重组连接; 此外, 在内存映射方案上引入系数分组存储和特殊内存映射方法, 利用地址映射规律对数据和旋转因子实现高效调度, 避免内存映射冲突问题, 以低成本实现内存访问; 最后, 采用先入先出(FIFO)结构实现数据重组, 提升数据访问效率。实验结果显示, 所提出的PM结构在Slices和数字信号处理器(DSP)的面积延时积(ATP)指标上相比于现有相关工作分别降低21.7%和61.1%以上, 具有更高的面积效率。

关键词: 格密码, 多项式乘法, 数论变换, 模约简, 无冲突内存映射

Abstract:

Lattice-based post-quantum cryptography algorithms demonstrate significant potential in public-key cryptography. A key performance bottleneck in hardware implementation is the computational complexity of polynomial multiplication. To address the problems of low area efficiency and memory mapping conflicts encountered in polynomial multiplication, this study proposes a polynomial multiplication structure based on Partial Number Theoretic Transform (PNTT) and a Coefficient Crossover Operation (CCO). First, the last round of the Number Theoretic Transform (NTT), coefficient multiplication, and the first round of the Inverse Number Theoretic Transform (INTT) are merged into a CCO, reducing two rounds of butterfly operations and 50% of the twiddle factor storage space; consequently, memory access overhead is lowered. Second, lightweight hardware is employed to implement modular addition, modular subtraction, division by two, and enhanced Barrett-based modular multiplication, effectively reducing the logical resource overhead. Simultaneously, the study designs a reconfigurable Processing Element (PE) array using pipeline and time-sharing multiplexing techniques, allowing each operation unit to be efficiently reconnected under different transformations. In addition, the study introduces coefficient grouping storage and special memory mapping methods in the memory mapping scheme. The efficient scheduling of data and twiddle factors is achieved by leveraging address-mapping rules, avoiding memory mapping conflicts, and achieving low-cost memory access. Finally, a First Input First Output (FIFO) structure is employed for data reorganization, which enhances data access efficiency. Experimental results show that the proposed polynomial multiplication structure reduces the Area-Time Product (ATP) of Slices and Digital Signal Processor (DSP) by over 21.7% and 61.1%, respectively, compared to existing works and has a higher area efficiency.

Key words: lattice-based cryptography, polynomial multiplication, Number Theoretic Transform (NTT), modular reduction, conflict-free memory mapping

谢家兴, 蒲金伟, 方伟钿, 郑欣, 熊晓明. 面积高效的格密码多项式乘法硬件实现[J]. 计算机工程, 2026, 52(1): 282-292.

XIE Jiaxing, PU Jinwei, FANG Weitian, ZHENG Xin, XIONG Xiaoming. Area-Efficient Polynomial Multiplication Hardware Implementation for Lattice-based Cryptography[J]. Computer Engineering, 2026, 52(1): 282-292.

https://www.ecice06.com/CN/Y2026/V52/I1/282

图/表 10

图1 蝶形结构实现多项式乘法的数据流图(8点)

Fig.1 Data flow diagram for implementing polynomial multiplication using butterfly structure (8-point)

图2 多项式乘法硬件实现架构

Fig.2 Polynomial multiplication hardware implementation architecture

图3 可重构PE阵列的3种变换结构

Fig.3 Three transformation structures of reconfigurable PE array

图4 模块单元电路结构

Fig.4 Module unit circuit structure

图5 系数内存映射方案(16点)

Fig.5 Coefficient memory mapping scheme (16-point)

图6 重排序单元

Fig.6 Reordering unit

参考文献 26

1	DUCAS L , KILTZ E , LEPOINT T , et al. CRYSTALS-dilithium: a lattice-based digital signature scheme. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2018 (1): 238- 268.
2	胡跃, 赵旭阳, 刘裕雄, 等. 格基密钥封装算法OSKR/OKAI硬件高效实现. 计算机学报, 2023, 46 (6): 1156- 1171.
	HU Y , ZHAO X Y , LIU Y X , et al. Hardware implementation of lattice-based key encapsulation mechanism algorithm OSKR/OKAI. Journal of Computers, 2023, 46 (6): 1156- 1171.
3	BISHEH-NIASAR M, AZARDERAKHSH R, MOZAFFARI-KERMANI M. A monolithic hardware implementation of kyber: comparing apples to apples inPQC candidates[C]//Proceedings of LATINCRYPT 2021. Berlin, Germany: Springer, 2021: 108-126.
4	FENG X , LI S G , XU S F . RLWE-oriented high-speed polynomial multiplier utilizing multi-lane stockham NTT algorithm. IEEE Transactions on Circuits and Systems Ⅱ: Express Briefs, 2020, 67 (3): 556- 559. doi: 10.1109/TCSII.2019.2917621
5	雷斗威, 何德彪, 罗敏, 等. 基于AVX512的格密码高速并行实现. 计算机工程, 2024, 50 (2): 15- 24. doi: 10.19678/j.issn.1000-3428.0067167
	LEI D W , HE D B , LUO M , et al. High-speed parallel implementation of lattice-based cryptography based on AVX512. Computer Engineering, 2024, 50 (2): 15- 24. doi: 10.19678/j.issn.1000-3428.0067167
6	CHEN X R , YANG B H , YIN S Y , et al. CFNTT: scalable radix-2/4 NTT multiplication architecture with an efficient conflict-free memory mapping scheme. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2022, 2022 (1): 94- 126.
7	MERT A C , KARABULUT E , OZTURK E , et al. An extensive study of flexible design methods for the number theoretic transform. IEEE Transactions on Computers, 2022, 71 (11): 2829- 2843. doi: 10.1109/TC.2020.3017930
8	BANERJEE U , UKYAB T S , CHANDRAKASAN A P . Sapphire: a configurable crypto-processor for post-quantum lattice-based protocols. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2019, 2019 (4): 17- 61.
9	SU Y , YANG B L , YANG C , et al. A highly unified reconfigurable multicore architecture to speed up NTT/INTT for homomorphic polynomial multiplication. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2022, 30 (8): 993- 1006. doi: 10.1109/TVLSI.2022.3166355
10	NGUYEN D T, DANG V B, GAJ K. A high-level synthesis approach to the software/hardware codesign of NTT-based post-quantum cryptography algorithms[C]//Proceedings of the International Conference on Field-Programmable Technology. Tianjin, China: IEEE Press, 2019: 371-374.
11	方伟钿, 蒲金伟, 谢家兴, 等. 后量子密码CRYSTALS-Dilithium的高性能实现. 小型微型计算机系统, 2025, 46 (5): 1273- 1280.
	FANG W T , PU J W , XIE J X , et al. High-performance implementation of post-quantum cryptography CRYSTALS-Dilithium. Journal of Chinese Computer Systems, 2025, 46 (5): 1273- 1280.
12	PÖPPELMANN T, ODER T, GVNEYSU T. High-performance ideal lattice-based cryptography on 8-bit ATxmega microcontrollers[C]//Proceedings of LATINCRYPT 2015. Berlin, Germany: Springer, 2015: 346-365.
13	LAND G, SASDRICH P, GVNEYSU T. A hard crystal-implementing dilithium on reconfigurable hardware[C]//Proceedings of International Conference on Smart Card Research and Advanced Applications. Berlin, Germany: Springer, 2021: 210-230.
14	BECKWITH L, NGUYEN D T, GAJ K. High-performance hardware implementation of CRYSTALS-dilithium[C]//Proceedings of the International Conference on Field-Programmable Technology. Auckland, New Zealand: IEEE Press, 2021: 1-10.
15	DERYA K , MERT A C , ÖZTVRK E , et al. CoHA-NTT: a configurable hardware accelerator for NTT-based polynomial multiplication. Microprocessors and Microsystems, 2022, 89, 104451. doi: 10.1016/j.micpro.2022.104451
16	HU X , TIAN J , LI M H , et al. AC-PM: an area-efficient and configurable polynomial multiplier for lattice based cryptography. IEEE Transactions on Circuits and Systems Ⅰ: Regular Papers, 2023, 70 (2): 719- 732. doi: 10.1109/TCSI.2022.3218192
17	陈朝晖, 马原, 荆继武. 格密码关键运算模块的硬件实现优化与评估. 北京大学学报(自然科学版), 2021, 57 (4): 595- 604.
	CHEN Z H , MA Y , JING J W . Hardware optimization and evaluation for crucial modules of lattice-based cryptography. Acta Scientiarum Naturalium Universitatis Pekinensis, 2021, 57 (4): 595- 604.
18	刘冬生, 赵文定, 刘子龙, 等. 应用于格密码的可重构多通道数论变换硬件设计. 电子与信息学报, 2022, 44 (2): 566- 572.
	LIU D S , ZHAO W D , LIU Z L , et al. Reconfigurable hardware design of multi-lanes number theoretic transform for lattice-based cryptography. Journal of Electronics & Information Technology, 2022, 44 (2): 566- 572.
19	LI B , YAN Y F , WEI Y X , et al. Scalable and parallel optimization of the number theoretic transform based on FPGA. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2024, 32 (2): 291- 304. doi: 10.1109/TVLSI.2023.3312423
20	FENG X , LI S G . Accelerating an FHE integer multiplier using negative wrapped convolution and Ping-pong FFT. IEEE Transactions on Circuits and Systems Ⅱ: Express Briefs, 2019, 66 (1): 121- 125. doi: 10.1109/TCSII.2018.2840108
21	COOLEY J W , TUKEY J W . An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 1965, 19 (90): 297- 301. doi: 10.1090/S0025-5718-1965-0178586-1
22	GENTLEMAN W M, SANDE G. Fast Fourier transforms: for fun and profit[C]//Proceedings of AFIPS'66. New York, USA: ACM Press, 1966: 563.
23	YE Z W , CHEUNG R C C , HUANG K J . PipeNTT: a pipelined number theoretic transform architecture. IEEE Transactions on Circuits and Systems Ⅱ: Express Briefs, 2022, 69 (10): 4068- 4072. doi: 10.1109/TCSII.2022.3184703
24	SU Y , YANG B L , YANG C , et al. ReMCA: a reconfigurable multi-core architecture for full RNS variant of BFV homomorphic evaluation. IEEE Transactions on Circuits and Systems Ⅰ: Regular Papers, 2022, 69 (7): 2857- 2870. doi: 10.1109/TCSI.2022.3163970
25	BARRETT P. Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor[C]//Proceedings of CRYPTO'86. Berlin, Germany: Springer, 1987: 311-323.
26	DUONG-NGOC P , KWON S , YOO D , et al. Area-efficient number theoretic transform architecture for homomorphic encryption. IEEE Transactions on Circuits and Systems Ⅰ: Regular Papers, 2023, 70 (3): 1270- 1283. doi: 10.1109/TCSI.2022.3225208

[1]	赵姜冬, 陈虎, 王晓毅. 高效的可链接环签名方案[J]. 计算机工程, 2025, 51(10): 203-212.
[2]	雷斗威, 何德彪, 罗敏, 彭聪. 基于AVX512的格密码高速并行实现[J]. 计算机工程, 2024, 50(2): 15-24.
[3]	卢嘉嘉, 杜育松. 整数上离散高斯取样的常数时间实现方法[J]. 计算机工程, 2020, 46(8): 119-123.
[4]	赵宗渠, 黄鹂娟, 范涛, 马少提. 格上基于KEM的认证密钥交换协议[J]. 计算机工程, 2020, 46(7): 122-128.
[5]	叶青, 王明明, 汤永利, 秦攀科, 王永军. 格上基于可编程哈希函数的HIBE方案[J]. 计算机工程, 2020, 46(1): 129-135,143.
[6]	牛淑芬,田苗,王彩芬,杜小妮. 格上基于同态加密的数据完整性验证方案[J]. 计算机工程, 2018, 44(8): 174-178,183.

选择文件类型/文献管理软件名称

选择包含的内容

面积高效的格密码多项式乘法硬件实现

Area-Efficient Polynomial Multiplication Hardware Implementation for Lattice-based Cryptography

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 26

相关文章 6

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

面积高效的格密码多项式乘法硬件实现

Area-Efficient Polynomial Multiplication Hardware Implementation for Lattice-based Cryptography

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 26

相关文章 6

编辑推荐

Metrics

本文评价