基于分层强化学习的在线三维装箱模型

doi:10.19678/j.issn.1000-3428.0069109

摘要/Abstract

摘要：

在过去的一些研究中, 人工智能如何以一种分层的方式在多个抽象级别和多个时间尺度上表示感知和行动规划逐渐成为一个研究热点。受限于技术手段, 多数工作都局限在人工分解任务阶段, 如在三维装箱问题(3D-BPP)中, 通过启发式规则指导神经网络解析打包点帮助智能体分解状态空间, 将原本庞大、复杂的空间转换为一个个子空间, 为神经网络提供更好的备选解决方案。然而这种方式受限于规则本身, 若规则不能完美地拆解问题, 则这种固定规则的辅助会限制神经网络的性能, 使得更好的解决方案被规则本身忽略。针对这种情况, 提出一种基于启发式规则融合策略的改进装箱配置树(PCT)模型, 通过分层强化学习的思想将问题分层, 引入图注意力分类模型来判断在当前情况下最优的空间点拓展方案, 由此为拆解箱体内部空间点与探寻可行性位置提供更多的排列组合方式。实验结果表明, 基于启发式规则融合策略的改进模型在多个数据集上表现优于原始模型, 在包含额外密度信息的数据集中平均装箱利用率高达77.2%, 较原始模型提升1.7百分点, 能够在合理的时间内给出性能更优的解决方案。

关键词: 分层强化学习, 三维装箱, 图注意力网络, 启发式空间拓展, 深度强化学习

Abstract:

Previous studies have shown increasing interest in understanding how artificial intelligence represents perception and action planning in a hierarchical manner across multiple abstraction levels and timescales. Owing to technological constraints, most studies have been limited to the artificial decomposition of tasks, such as the 3D Bin Packing Problem (3DBPP). In this scenario, heuristic rules guide neural networks in the analysis of the packing points during the task decomposition stage, thus helping the agent decompose the state space. This transforms the originally vast and complex space into individual subspaces, thereby providing the neural network with better alternative solutions. However, these rules cause performance limitations. If the rules cannot perfectly decompose the problem, fixed-rule assistance may restrict the performance of the neural network by overlooking better solutions that the rules may ignore. To address this problem, a heuristic rule fusion strategy is used in this study to improve the original Packing Configuration Tree (PCT) model. This strategy is based on the concept of hierarchical reinforcement learning to layer the problem, in which a graph attention classification model is introduced to determine the optimal spatial point expansion scheme for the current situation. This approach ensures more possibilities for the combination and arrangement of dismantling internal space points and exploring feasible positions. The results of experiments show that the improved model based on heuristic fusion strategy for layered problems performs better than the original model on multiple datasets. In datasets containing additional density information, the average packing utilization rate reaches 77.2%, which is a 1.7 percentage point improvement over the original model. The proposed model provides more optimal solutions within a reasonable amount of time.

Key words: hierarchical reinforcement learning, 3D bin packing, Graph Attention Network (GAT), heuristic space expansion, deep reinforcement learning

亓明凯, 王迪, 张立晔. 基于分层强化学习的在线三维装箱模型[J]. 计算机工程, 2025, 51(6): 136-145.

QI Mingkai, WANG Di, ZHANG Liye. Online 3D Bin Packing Model Based on Hierarchical Reinforcement Learning[J]. Computer Engineering, 2025, 51(6): 136-145.

https://www.ecice06.com/CN/Y2025/V51/I6/136

图/表 8

图1 分层模型结构

Fig.1 Hierarchical model structure

图2 CHRL模型结构

Fig.2 CHRL model structure

图3 不同数据集下的可视化结果

Fig.3 Visualization results under different datasets

图4 模型训练曲线

Fig.4 Model training curves

图5 奖励函数在训练曲线上的作用效果

Fig.5 Effect of reward function on the training curves

参考文献 28

1	MARTELLO S , PISINGER D , VIGO D . The three-dimensional bin packing problem. Operations Research, 2000, 48 (2): 256- 267. doi: 10.1287/opre.48.2.256.12386
2	董立岩, 齐竞则, 刘元宁, 等. 基于偏好和虚拟适应度的两阶段依赖任务卸载算法. 吉林大学学报(理学版), 2024, 62 (4): 923- 932.
	DONG L Y , QI J Z , LIU Y N , et al. Two-stage dependent task offloading algorithm based on preference and virtual fitness. Journal of Jilin University (Science Edition), 2024, 62 (4): 923- 932.
3	HA C T , NGUYEN T T , BUI L T , et al. An online packing heuristic for the three-dimensional container loading problem in dynamic environments and the physical Internet. Berlin, Germany: Springer International Publishing, 2017.
4	CRAINIC T G , PERBOLI G , TADEI R . Extreme point-based heuristics for three-dimensional bin packing. INFORMS Journal on Computing, 2008, 20 (3): 368- 384. doi: 10.1287/ijoc.1070.0250
5	LI X, ZHAO Z, ZHANG K. A genetic algorithm for the three-dimensional bin packing problem with heterogeneous bins[C]// Proceedings of Industrial and Systems Engineering Research Conference. Berlin, Germany: Springer, 2014: 2039.
6	LAYEB A , CHENCHE S . A novel GRASP algorithm for solving the bin packing problem. International Journal of Information Engineering and Electronic Business, 2012, 4 (2): 8- 14. doi: 10.5815/ijieeb.2012.02.02
7	NGUYEN T H , NGUYEN X T . Space splitting and merging technique for online 3-D bin packing. Mathematics, 2023, 11 (8): 1912. doi: 10.3390/math11081912
8	ZHAO H, SHE Q J, ZHU C Y, et al. Online 3D bin packing with constrained deep reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2021: 741-749.
9	WANG Z F, CHEN Y, LIU C L, et al. Guided reinforce learning through spatial residual value for online 3D bin packing[C]//Proceedings of the 49th Annual Conference of the IEEE Industrial Electronics Society. Washington D.C., USA: IEEE Press, 2023: 1-5.
10	SONG S, YANG S, SONG R, et al. Towards online 3D bin packing: learning synergies between packing and unpacking via DRL[C]//Proceedings of Conference on Robot Learning. [S. l. ]: PMLR, 2023: 1136-1145.
11	YANG S , SONG S , CHU S L , et al. Heuristics integrated deep reinforcement learning for online 3D bin packing. IEEE Transactions on Automation Science and Engineering, 2024, 21 (1): 939- 950. doi: 10.1109/TASE.2023.3235742
12	ZHAO H, YU Y, XU K. Learning efficient online 3D bin packing on packing configuration trees[C]// Proceedings of International Conference on Learning Representations. Berlin, Germany: Springer, 2006: 1-10.
13	PAN Y X, CHEN Y Z, LIN F Z. Adjustable robust reinforcement learning for online 3D bin packing[EB/OL]. [2023-11-07]. https://arxiv.org/abs/2310.04323v1.
14	BARTO A G, SINGH S, CHENTANEZ N. Intrinsically motivated learning of hierarchical collections of skills[C]//Proceedings of the 3rd International Conference on Development and Learning. Berlin, Germany: Springer, 2004: 19.
15	SUTTON R S , PRECUP D , SINGH S . Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 1999, 112 (1/2): 181- 211.
16	KULKARNI T D, NARASIMHAN K, SAEEDI A, et al. Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2016: 3682-3690.
17	李烨, 肖梦巧. 大规模MIMO系统中功率分配的深度强化学习方法. 小型微型计算机系统, 2023, 44 (10): 2221- 2227.
	LI Y , XIAO M Q . Deep reinforcement learning approach for power allocation in massive MIMO systems. Journal of Chinese Computer Systems, 2023, 44 (10): 2221- 2227.
18	BACON P L, HARB J, PRECUP D, et al. The option-critic architecture[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2017: 1726-1734.
19	孟子晗, 高翔, 刘元归, 等. 基于分层强化学习的机械臂复杂操作技能学习方法. 现代电子技术, 2023, 46 (19): 116- 124.
	MENG Z H , GAO X , LIU Y G , et al. Complex manipulation skill learning approach based on hierarchical reinforcement learning for robot manipulator. Modern Electronics Technique, 2023, 46 (19): 116- 124.
20	LEVY A, KONIDARIS G, PLATT R, et al. Learning multi-level hierarchies with hindsight[EB/OL]. [2023-11-07]. https://arxiv.org/abs/1712.00948.
21	PARISOTTO E, SALAKHUTDINOV R, RAMANI D. Neural map: structured memory for deep reinforcement learning[EB/OL]. [2023-11-07]. https://arxiv.org/abs/1702.08360v1.
22	KULKARNI T D, SAEEDI A, GAUTAM S, et al. Deep successor reinforcement learning[EB/OL]. [2023-11-07]. https://arxiv.org/abs/1606.02396v1.
23	VEZHNEVETS A S, OSINDERO S, SCHAUL T, et al. Feudal networks for hierarchical reinforcement learning[C]//Proceedings of the 34th International Conference on Machine Learning. New York, USA: ACM Press, 2017: 3540-3549.
24	FORTUNATO M, AZAR M G, PIOT B, et al. Noisy networks for exploration[EB/OL]. [2023-11-07]. https://arxiv.org/abs/1706.10295v3.
25	孙崇, 王海荣, 荆博祥, 等. 一种分层强化学习的知识推理方法. 计算机应用研究, 2024, 41 (3): 805- 810.
	SUN C , WANG H R , JING B X , et al. Knowledge reasoning method based on hierarchical reinforcement learning. Application Research of Computers, 2024, 41 (3): 805- 810.
26	张倩, 李天皓, 白春光. 基于多智能体强化学习的分层决策优化方法. 电子科技大学学报(社科版), 2022, 24 (6): 90- 96.
	ZHANG Q , LI T H , BAI C G . Hierarchical decision optimization method based on multi-agent reinforcement learning. Journal of University of Electronic Science and Technology of China (Social Sciences Edition), 2022, 24 (6): 90- 96.
27	ZHAO H , ZHU C Y , XU X , et al. Learning practically feasible policies for online 3D bin packing. Science China Information Sciences, 2021, 65 (1): 112105.
28	WU Y, MANSIMOV E, GROSSE R B, et al. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation[EB/OL]. [2023-11-07]. https://arxiv.org/abs/1708.05144.

[1]	吕超峰, 徐鹏飞, 罗迪, 刘金平. 基于多智能体深度强化学习的SD-IoT控制器部署[J]. 计算机工程, 2025, 51(5): 83-92.
[2]	吴凯峰, 刘磊, 刘晨, 梁成庆. 基于融合课程思想MADDPG的无人机编队控制[J]. 计算机工程, 2025, 51(5): 73-82.
[3]	李思源, 钟兴宇, 李凯茵, 徐清振. 基于多层图关系和强化学习的策略教学研究[J]. 计算机工程, 2025, 51(3): 122-130.
[4]	林绍福, 陈盈盈, 李硕朋. 基于深度强化学习的多无人机能量传输与边缘计算联合优化方法[J]. 计算机工程, 2025, 51(3): 144-154.
[5]	曾建州, 李泽平, 张素勤. 基于TD3算法的多智能体协作缓存策略[J]. 计算机工程, 2025, 51(2): 365-374.
[6]	石琼, 段辉, 师智斌. 基于深度强化学习的可信任务卸载方案[J]. 计算机工程, 2024, 50(8): 142-152.
[7]	孙文洁, 李宗民, 孙浩淼. 基于图神经网络的多智能体强化学习值函数分解方法[J]. 计算机工程, 2024, 50(5): 62-70.
[8]	傅明建, 郭福强. 基于深度强化学习的无信号灯路口决策研究[J]. 计算机工程, 2024, 50(5): 91-99.
[9]	杜海军, 余粟. 基于时空图注意力网络的服务机器人动态避障[J]. 计算机工程, 2024, 50(2): 105-112.
[10]	倪苏婕, 陈兵, 石优. 一种联合V2I和V2V的任务卸载优化方案[J]. 计算机工程, 2024, 50(12): 174-183.
[11]	何杰, 马强. 基于深度强化学习的C-V2X任务卸载研究[J]. 计算机工程, 2024, 50(12): 200-212.
[12]	史昕, 曹凤腾, 纪艺, 马峻岩. 基于多尺度时空特征与软注意力机制的交通流预测方法[J]. 计算机工程, 2024, 50(12): 346-357.
[13]	江敏, 陈飞, 程航, 王美清. 基于逐像素强化学习的边缘保持图像复原[J]. 计算机工程, 2024, 50(12): 224-232.
[14]	宋艳蕊, 庄雷, 徐泽汐, 冯旭, 莫文帅. 基于云边协同的可靠服务功能链部署算法[J]. 计算机工程, 2024, 50(12): 184-193.
[15]	詹泽慧, 钟煊妍, 邹萱萱, 骆丽霞. 基于BERT-HAN增强人机对话的计算思维评估模型[J]. 计算机工程, 2024, 50(12): 110-123.

选择文件类型/文献管理软件名称

选择包含的内容