基于预测-协同优化的机械臂零动作模仿学习

doi:10.19678/j.issn.1000-3428.0252432

摘要/Abstract

摘要： 强化学习在机器人控制中面临奖励函数设计困难的挑战，而模仿学习虽规避了奖励工程的难题，却需依赖高成本的专家动作数据。为此，研究提出一种基于预测-协同优化的机械臂零动作模仿学习框架。该方法融合模型预测控制（MPC）与最大后验概率（MAP）的贝叶斯修正，通过多步动作序列优化实现机械臂精准操控，同时消除对专家动作数据和人工奖励设计的依赖。框架的核心是利用MPC的滚动优化机制，以最小化多步状态误差为目标，动态调整动作序列，增强对噪声和预测不确定性的鲁棒性。在此过程中，MAP方法被引入到单步优化，通过先验分布与似然性修正每个动作，提升动作优化的局部合理性与效率。与传统方法不同，该框架仅依赖专家状态而非专家动作，通过预测模型生成目标状态，避免了专家动作数据收集的困难，同时克服了预测误差累积的问题。实验结果表明，该方法在多种机械臂仿真任务中均优于现有基线方法，其中平均回报提升约45.8%，预测误差降低约50.7%，展现了更高的动作执行精度和对复杂环境的适应能力，并在真实机械臂平台上实现了稳定的控制，验证了跨平台工程化潜力。

Abstract: Reinforcement learning faces the challenge of difficulty in designing reward functions in robot control, while imitation learning, although it avoids the problem of reward engineering, relies on high-cost expert motion data. To this end, the research proposes a zero-motion imitation learning framework for robotic arms based on Predictive-Collaborative Optimization. This method integrates model predictive control (MPC) with Bayesian correction of the maximum a posteriori (MAP), achieving precise control of the robotic arm through multi-step action sequence optimization, while eliminating the reliance on expert action data and manual reward design. The core of the framework is to utilize the rolling optimization mechanism of MPC, aiming to minimize multi-step state errors, dynamically adjust the action sequence, and enhance robustness against noise and prediction uncertainties. During this process, the MAP method is introduced into single-step optimization, where each action is corrected through prior distribution and likelihood, thereby enhancing the local rationality and efficiency of action optimization. Unlike traditional methods, this framework relies only on expert states rather than expert actions. It generates the target state through a prediction model, avoiding the difficulty of collecting expert action data and simultaneously overcoming the problem of accumulated prediction errors. The experimental results show that this method outperforms the existing baseline methods in various robotic arm simulation tasks, with an average return increase of approximately 45.8% and a prediction error reduction of approximately 50.7%. It demonstrates higher action execution accuracy and adaptability to complex environments, and has achieved stable control on a real robotic arm platform, verifying the potential for cross-platform engineering.

闫平, 杨杰龙, 黄道缘, 钟石峰. 基于预测-协同优化的机械臂零动作模仿学习[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252432.

YAN Ping, YANG Jielong, HUANG Daoyuan, ZHONG Shifeng. Zero-Action Imitation Learning for Robotic Arm via Predictive-Collaborative Optimization[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252432.

参考文献

[1]Shehawy H, Pareyson D, Caruso V, et al. Flattening and folding towels with a single-arm robot based on reinforcement learning[J]. Robotics and Autonomous Systems, 2023, 169: 104506.
[2]闫皎洁, 张锲石, 胡希平. 基于强化学习的路径规划技术综述[J]. 计算机工程, 2021, 47(10): 16-25. YAN Jiaojie, ZHANG Qieshi, HU Xiping. Review of Path Planning Techniques Based on Reinforcement Learning[J]. Computer Engineering, 2021, 47(10): 16-25.（in Chinese）
[3]Jang I, Noh S, Kim S, et al. An Analysis of the Impact of Dataset Characteristics on Data-Driven Reinforcement Learning for a Robotic Long-Horizon Task[C]//2023 14th International Conference on Information and Communication Technology Convergence (ICTC). Piscataway, NJ, USA: IEEE, 2023: 1681-1683.
[4]Cui Y, Xu Z, Zhong L, et al. A task-adaptive deep reinforcement learning framework for dual-arm robot manipulation[J]. IEEE Transactions on Automation Science and Engineering, 2024, 22: 466-479.
[5]赵寅甫, 冯正勇. 基于深度强化学习的机械臂控制快速训练方法[J]. 计算机工程, 2022, 48(8): 113-120. ZHAO Yinfu, FENG Zhengyong. Fast Training Method for Manipulator Control Based on Deep Reinforcement Learning[J]. Computer Engineering, 2022, 48(8): 113-120.（in Chinese）
[6]Hu J, Stone P, Martín-Martín R. Causal policy gradient for whole-body mobile manipulation [EB/OL]. [2023-09-28]. https://arxiv.org/abs/2305.04866
[7]Zhang H, Liang H, Cong L, et al. Reinforcement learning based pushing and grasping objects from ungraspable poses[C]//2023 IEEE International Conference on Robotics and Automation (ICRA). Piscataway, NJ, USA: IEEE, 2023: 3860-3866.
[8]Skrynnik A, Staroverov A, Aitygulov E, et al. Forgetful experience replay in hierarchical reinforcement learning from expert demonstrations[J]. Knowledge-Based Systems, 2021, 218: 106844.
[9]Ramirez J, Yu W. Redundant robot control with learning from expert demonstrations[C]//2022 IEEE Symposium Series on Computational Intelligence (SSCI). Piscataway, NJ, USA: IEEE, 2022: 715-720.
[10]Yu Z, Feng Y, Liu L. Reward-free reinforcement learning algorithm using prediction network[M]. Amsterdam, Netherlands: IOS Press, 2020: 663-670.
[11]Florence P, Manuelli L, Tedrake R. Self-supervised correspondence in visuomotor policy learning[J]. IEEE Robotics and Automation Letters, 2019, 5(2): 492-499.
[12]Siegel N Y, Springenberg J T, Berkenkamp F, et al. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning[EB/OL]. [2020-06-17]. https://arxiv.org/abs/2002.08396.
[13]Lu G, Yu T, Deng H, et al. Anybimanual: Transferring unimanual policy for general bimanual manipulation [EB/OL]. [2024-12-09]. https://arxiv.org/abs/2412.06779.
[14]张超,白文松,杜歆,等. 模仿学习综述: 传统与新进展[J]. 中国图象图形学报, 2023, 28(6): 1585-1607. Zhang Chao, Bai Wensong, Du Xin, et al. Survey of imitation learning: tradition and new advances [J]. Journal of Image and Graphics, 2023, 28(6): 1585-1607.（in Chinese）
[15]Ramirez J, Yu W. Reinforcement learning from expert demonstrations with application to redundant robot control[J]. Engineering Applications of Artificial Intelligence, 2023, 119: 105753.
[16]Gómez P I, Gajardo M E L, Mijatovic N, et al. Enhanced Imitation Learning of Model Predictive Control Through Importance Weighting[J]. IEEE Transactions on Industrial Electronics, 2024, 72(4): 4073-4083.
[17]Lambert N O. Synergy of Prediction and Control in Model-based Reinforcement Learning[D]. Berkeley: University of California, 2022.
[18]Duan A, Batzianoulis I, Camoriano R, et al. A structured prediction approach for robot imitation learning[J]. The International Journal of Robotics Research, 2024, 43(2): 113-133.
[19]Krishnan K G. Using deep reinforcement learning for robot arm control[J]. Journal of Artificial Intelligence and Capsule Networks, 2022, 4(3): 160-166.
[20]Franceschetti A, Tosello E, Castaman N, et al. Robotic arm control and task training through deep reinforcement learning[C]//International Conference on Intelligent Autonomous Systems. Cham: Springer International Publishing, 2021: 532-550.
[21]Lin Q, Ling Q. Robust Reward-Free Actor–Critic for Cooperative Multiagent Reinforcement Learning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 35(12): 17318–17329.
[22]Chang J, Uehara M, Sreenivas D, et al. Mitigating covariate shift in imitation learning via offline data with partial coverage[J]. Advances in Neural Information Processing Systems, 2021, 34: 965-979.
[23]Hoque R, Mandlekar A, Garrett C, et al. Intervengen: Interventional data generation for robust and data-efficient robot imitation learning[C]//2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Piscataway, NJ, USA: IEEE, 2024: 2840-2846.
[24]Spencer J, Choudhury S, Barnes M, et al. Expert intervention learning: An online framework for robot learning from explicit and implicit human feedback[J]. Autonomous Robots, 2022, 46(1): 99-113.
[25]Reichlin A, Marchetti G L, Yin H, et al. Back to the manifold: Recovering from out-of-distribution states[C]//2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Piscataway, NJ, USA: IEEE, 2022: 8660-8666.
[26]Hassan M S, Sanaullah S M. Robotic Arm Manipulation with Inverse Reinforcement Learning & TD-MPC [EB/OL]. [2024-08-07]. https://arxiv.org/abs/2407.12941.
[27]宋莉, 李大字, 徐昕. 逆强化学习算法, 理论与应用研究综述[J]. 自动化学报, 2024, 50(9): 1704-1723. Song Li, Li Dazi, Xu Xin. A Survey of Inverse Reinforcement Learning Algorithms, Theory and Applications [J]. Acta Automatica Sinica, 2024, 50(9): 1704-1723. （in Chinese）
[28]Wang B, Adeli E, Chiu H, et al. Imitation learning for human pose prediction[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2019: 7124-7133.
[29]Cheng C A, Yan X, Theodorou E, et al. Accelerating imitation learning with predictive models[C]//The 22nd International Conference on Artificial Intelligence and Statistics. Cambridge, MA: PMLR, 2019: 3187-3196.
[30]Ahn K, Mhammedi Z, Mania H, et al. Model predictive control via on-policy imitation learning[C]//Learning for Dynamics and Control Conference. Cambridge, MA: PMLR, 2023: 1493-1505.
[31]Tunyasuvunakool S, Muldal A, Doron Y, et al. dm_control: Software and tasks for continuous control[J]. Software Impacts, 2020, 6: 100022.
[32]Vecchietti L F, Seo M, Har D. Sampling rate decay in hindsight experience replay for robot control[J]. IEEE Transactions on Cybernetics, 2020, 52(3): 1515-1526.
[33]Zakka K, Tabanpour B, Liao Q, et al. Mujoco playground [EB/OL]. [2025-02-12]. https://arxiv.org/abs/2502.08844.
[34]Setiaji B, Pujastuti E, Filza M F, et al. Implementation of reinforcement learning in 2d based games using open AI gym[C]//2022 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS). Piscataway, NJ, USA: IEEE, 2022: 293-297.
[35]Sekar R, Rybkin O, Daniilidis K, et al. Planning to explore via self-supervised world models[C]//International conference on machine learning. Cambridge, MA: PMLR, 2020: 8583-8592.
[36]Hejna J, Sadigh D. Inverse preference learning: Preference-based rl without a reward function[J]. Advances in Neural Information Processing Systems, 2023, 36: 18806-18827.
[37]Agarwal R, Schwarzer M, Castro P S, et al. Deep reinforcement learning at the edge of the statistical precipice[J]. Advances in neural information processing systems, 2021, 34: 29304-29320.

选择文件类型/文献管理软件名称

选择包含的内容