作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

可重构密码处理器研究与实现

  • 发布日期:2026-01-09

Research and Implementation of Reconfigurable Cryptographic Processors

  • Published:2026-01-09

摘要: 基于多算法适配需求,设计了一种可重构专用指令集处理器,用于高效支持分组密码与哈希函数。该架构采用超长指令字(VLIW)结构,结合分簇对称执行单元与跨簇寄存器访问机制,实现了逻辑运算、移位、查表等操作的并行化处理。在指令设计上,引入逻辑与查表融合指令、多模式移位指令及向量化操作,减少流水线停顿并提升指令密度。流水线方面,采用三级取指-译码-执行结构,并通过旁路机制解决数据相关问题,缩短关键路径。在算法映射与优化中,分组密码算法如SM4与AES利用Tbox查表与四簇并行调度,将每轮运算压缩至4与7个周期;哈希类算法如SHA-256与SM3通过多模式移位与布尔逻辑指令融合实现,每轮保持在8个周期;SHA-3则基于三阶段映射策略,将五个运算步骤重组为三步流水化执行,显著缓解依赖带来的停顿。硬件实现方面,在Xilinx Kintex-7 FPGA(XC7K325TFFG676-2)平台上完成综合,消耗11105个查找表(LUT)、1564个触发器(FF)、25个片上存储(BRAM),主频为125MHz。在该条件下,处理器实现了SM4 125 Mbps、AES 228.6 Mbps、SHA-256 125 Mbps、SM3 125 Mbps、SHA-3 75.6Mbps的吞吐率。实验结果表明,该架构在低资源开销下实现了多算法的统一加速,性能优于通用处理器扩展方案,具有良好的灵活性与可扩展性。

Abstract: 】To address the demand for multi-algorithm adaptability, a reconfigurable application-specific instruction set processor is designed to efficiently support block ciphers and hash functions. The architecture adopts a Very Long Instruction Word (VLIW) structure, combined with symmetric clustered execution units and a cross-cluster register access mechanism, enabling parallel processing of logic operations, shifts, and table lookups. In instruction set design, fused logic–lookup instructions, multi-mode shift instructions, and vector operations are introduced to reduce pipeline stalls and enhance instruction density. The pipeline is organized into three stages—fetch, decode, and execute—while a bypass mechanism is employed to resolve data hazards and shorten the critical path. In algorithm mapping and optimization, block ciphers such as SM4 and AES leverage T-box lookups and four-cluster parallel scheduling, reducing each round to 4 and 7 cycles, respectively; hash functions such as SHA-256 and SM3 utilize multi-mode shift and fused Boolean logic instructions, achieving 8 cycles per round; SHA-3 is mapped through a three-phase strategy that reorganizes its five steps into three pipelined stages, effectively mitigating dependency-induced stalls. For hardware implementation, synthesis is carried out on the Xilinx Kintex-7 FPGA (XC7K325TFFG676-2), consuming 11,105 look-up tables (LUTs), 1,564 flip-flops (FFs), and 25 block RAMs (BRAMs), operating at a frequency of 125 MHz. Under these conditions, the processor achieves throughputs of 125 Mbps for SM4, 228.6 Mbps for AES, 125 Mbps for SHA-256, 125 Mbps for SM3, and 75.6 Mbps for SHA-3. The experimental results demonstrate that this architecture achieves unified acceleration of multiple algorithms with low resource overhead, outperforming general-purpose processor extensions, while offering high flexibility and scalability.