Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2026, Vol. 52 ›› Issue (4): 140-162. doi: 10.19678/j.issn.1000-3428.0070166

• Computational Intelligence and Pattern Recognition • Previous Articles     Next Articles

Deep Reinforcement Learning Algorithms for Heterogeneous Multiple Knapsack Problems

LI Bin1,2,*(), GUO Yi2,3   

  1. 1. School of Mechanical and Automotive Engineering, Fujian University of Technology, Fuzhou 350118, Fujian, China
    2. Fujian Provincial Key Laboratory of Big Data Mining and Applications, Fujian University of Technology, Fuzhou 350118, Fujian, China
    3. School of Computer Science and Mathematics, Fujian University of Technology, Fuzhou 350118, Fujian, China
  • Received:2024-07-23 Revised:2024-09-13 Online:2026-04-15 Published:2024-11-13
  • Contact: LI Bin

面向异构多背包问题的深度强化学习算法

李斌1,2,*(), 郭毅2,3   

  1. 1. 福建理工大学机械与汽车工程学院, 福建 福州 350118
    2. 福建理工大学福建省大数据挖掘与应用技术重点实验室, 福建 福州 350118
    3. 福建理工大学计算机科学与数学学院, 福建 福州 350118
  • 通讯作者: 李斌
  • 作者简介:

    李斌(CCF高级会员), 男, 教授、博士后, 主研方向为机器学习、群集智能、智慧港航

    郭毅, 硕士研究生

  • 基金资助:
    教育部人文社会科学研究规划基金(19YJA630031)

Abstract:

By focusing on the traditional multi-Knapsack Problem (KP) in typical logistics system operations, this study abstracts a Heterogeneous Multiple Knapsack Problem (HMKP) and formulates an improved Deep Deterministic Policy Gradient (DDPG) algorithm to solve it. The DDPG algorithm tends to fall into a local optimum when solving the 0-1 KP. To address this issue, a Dynamic Randomization Mechanism (DRM) and Dynamic Penalty Mechanism (DPM) are adopted and embedded with an improved Transformer module to optimize the algorithm. Then, a Dynamic DDPG (TDP-DDPG) algorithm is proposed based on the improved Transformer module. First, a tabu list is added to prevent repeated searches. The TDP-DDPG algorithm demonstrates efficient search capability across several experimental algorithms, finding the ideal optimum in 39 classical algorithms in test sets 1 and 2 from low to high dimensionality, as well as in higher dimensionality test set 3 and in three out of six algorithms in large-scale test set 4. Experiments show that the TDP-DDPG algorithm has stronger optimization seeking ability after incorporating the improved strategy. Next, the BPD-DDPG algorithm is designed based on the TDP-DDPG algorithm to solve the HMKP with higher complexity and is analyzed and evaluated in high-dimensional arithmetic by combining several classical 0-1 KP examples. The results show that the BPD-DDPG algorithm is more accurate than Gurobi in three low-scale cases; however, the solution time is longer. The BPD-DDPG algorithm can efficiently solve high-dimension, large-scale HMKP at a low computational cost within an acceptable time.

Key words: Deep Reinforcement Learning (DRL), 0-1 Knapsack Problem (KP), Heterogeneous Multiple Knapsack Problem (HMKP), Transformer module, dynamic penalty mechanism, tabu list

摘要:

从传统多背包问题(KP)与典型物流系统运作场景出发, 抽象出异构多背包问题(HMKP), 并制定改进深度确定性策略梯度(DDPG)算法对HMKP进行研究和求解。针对DDPG算法在解决0-1 KP时容易陷入局部最优的缺点, 采用动态随机机制(DRM)和动态惩罚机制(DPM)对DDPG算法进行改进, 并嵌入改进Transformer模块来优化算法, 提出基于改进Transformer模块的动态深度确定性策略梯度(TDP-DDPG)算法, 并加入禁忌表防止重复搜索。TDP-DDPG算法在多个实验算例中展现了高效的搜索能力, 在由低到高维度的测试集1、2以及更高维度的测试集3中所有39个算例都能找到最优值, 在大规模测试集4的6个算例中有3个能找到最优值。实验表明, TDP-DDPG算法在融入改进策略后具备更强的寻优能力。在此基础上, 设计基于TDP-DDPG算法的BPD-DDPG算法来解决复杂度更高的HMKP, 且分别在多个经典0-1 KP算例组合而成的高维度算例中进行分析评估。结果显示BPD-DDPG算法与商业求解器Gurobi相比虽求解时间长, 但在3个低规模算例中求解准确率比Gurobi高。BPD-DDPG算法能在可接受时间范围内以低计算代价高效解决高维度、大规模的HMKP。

关键词: 深度强化学习, 0-1背包问题, 异构多背包问题, Transformer模块, 动态惩罚机制, 禁忌表