基于循环卷积神经网络的POMDP值迭代算法

doi:10.19678/j.issn.1000-3428.0057027

计算机工程 ›› 2021, Vol. 47 ›› Issue (2): 90-94,102. doi: 10.19678/j.issn.1000-3428.0057027

基于循环卷积神经网络的POMDP值迭代算法

于丹宁, 倪坤, 刘云龙

厦门大学航空航天学院, 福建厦门 361102

收稿日期:2019-12-25 修回日期:2020-02-04 出版日期:2021-02-15 发布日期:2020-02-12
作者简介:于丹宁(1994-),女,硕士研究生,主研方向为深度强化学习、智能体决策;倪坤,硕士研究生;刘云龙(通信作者),副教授、博士。
基金资助:
国家自然科学基金（61772438，61375077）。

Value Iteration Algorithm for POMDP Based on Recurrent Convolutional Neural Network

YU Danning, NI Kun, LIU Yunlong

School of Aerospace Engineering, Xiamen University, Xiamen, Fujian 361102, China

Received:2019-12-25 Revised:2020-02-04 Online:2021-02-15 Published:2020-02-12

摘要/Abstract

摘要： 基于卷积神经网络的部分可观测马尔科夫决策过程（POMDP）值迭代算法QMDP-net在无先验知识的情况下具有较好的性能表现，但其存在训练效果不稳定、参数敏感等优化难题。提出基于循环卷积神经网络的POMDP值迭代算法RQMDP-net，使用门控循环单元网络实现值迭代更新，在保留输入和递归权重矩阵卷积特性的同时增强网络时序处理能力。实验结果表明，RQMDP-net在10×10网格地图规划任务中导航准确率高达98.5%，且在36×36网格地图规划任务中相比QMDP-net最多提升5.8个百分点，具有更快的网络收敛速度和更强的导航任务规划能力。

关键词: 部分可观测马尔科夫决策过程, 值迭代, 卷积神经网络, 循环卷积神经网络, 智能体规划

Abstract: The value iteration algorithm,QMDP-net,for Partially Observable Markov Decision Process(POMDP) based on Convolutional Neural Network(CNN) performs well in cases of no prior knowledge.However,it often suffers from instable training results,sensitive parameter and other optimization problems.For these problems,this paper proposes a value iteration algorithm called RQMDP-net for POMDP based on Recurrent Convolutional Neural Network(RCNN).The update of value iteration is realized by using Gated Recurrent Unit(GRU),which keeps the input and convolution features of the recursive weight matrix,and enhances the sequential processing ability of the network. Experimental results show that the navigation accuracy of RQMDP-net for 10×10 planning tasks in the grid map reaches 98.5%,and is up to 5.8 percentage points higher than that of QMDP-net for 36×36 planning tasks in the grid map,which demonstrates that RQMDP-net has a higher network convergence speed and better planning ability in navigation tasks.

Key words: Partially Observable Markov Decision Process(POMDP), value iteration, Convolutional Neural Network(CNN), Recurrent Convolutional Neural Network(RCNN), agent planning

中图分类号:

TP18

于丹宁, 倪坤, 刘云龙. 基于循环卷积神经网络的POMDP值迭代算法[J]. 计算机工程, 2021, 47(2): 90-94,102.

YU Danning, NI Kun, LIU Yunlong. Value Iteration Algorithm for POMDP Based on Recurrent Convolutional Neural Network[J]. Computer Engineering, 2021, 47(2): 90-94,102.

https://www.ecice06.com/CN/Y2021/V47/I2/90

图/表 3

参考文献 21

[1]	HU Bo,WANG Qiyao,FENG Hui,et al.Adaptive sensor scheduling algorithm for target tracking in wireless sensor networks[J].Journal of Electronics and Information Technology,2018,40(9):2033-2041.(in Chinese)胡波,王祺尧,冯辉,等.一种无线传感器网络中目标跟踪的自适应节点调度算法[J].电子与信息学报,2018,40(9):2033-2041.
[2]	TESAURO G.TD-gammom,a self-teaching backgammon program,achieves master-lever play[J].Neural Computation,1994,6(2):215-219.
[3]	LIU Feng,WANG Chongjun,LUO Bin.A probability-based value iteration on optimal policy algorithm for POMDP[J].Acta Electronica Sinica,2016,44(5):1078-1084.(in Chinese)刘峰,王崇骏,骆斌.一种基于最优策略概率分布的POMDP值迭代算法[J].电子学报,2016,44(5):1078-1084.
[4]	SILVER D,VENESS J.Monte-Carlo planning in large POMDPs[M].Cambridge,USA:MIT Press,2010.
[5]	LITTMAN M L,CASSANDRA A R,KAELBLING L P.Learning policies for partially observable environments:scaling up[C]//Proceedings of International Conference on Machine Learning.New York,USA:ACM Press,1995:362-370.
[6]	HAN Bing.The design and implementation of point-based POMDP policy iteration algorithm[D].Nanjing:Nanjing University,2014.(in Chinese)韩冰.基于点的POMDP策略迭代算法设计与实现[D].南京:南京大学,2014.
[7]	LIU Yunlong,LI Renhou,LIU Jianshu.Q-learning algorithm based on predictive state representations[J].Journal of Xi'an Jiaotong University,2008,42(12):1472-1475.(in Chinese)刘云龙,李人厚,刘建书.基于预测状态表示的Q学习算法[J].西安交通大学学报,2008,42(12):1472-1475.
[8]	LIU Quan,ZHAI Jianwei,ZHANG Zongzhang.A survey on deep reinforcement learning[J].Chinese Journal of Computers,2018,41(1):3-29.(in Chinese)刘全,翟建伟,章宗长.深度强化学习综述[J].计算机学报,2018,41(1):3-29.
[9]	KARKUS P,HSU D,LEE W S.QMDP-Net:deep learning for planning under partial observability[EB/OL].[2019-11-04].https://arxiv.org/abs/1703.06692.
[10]	YU Kai,JIA Lei,CHEN Yuqiang,et al.Deep learning:yesterday,today,and tomorrow[J].Journal of Computer Research and Development,2013,50(9):1799-1804.
[11]	HAARNOJA T,AJAY A,LEVINE S,et al.Backprop KF:learning discriminative deterministic state estimators[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.New York,USA:ACM Press,2016:4376-4384.
[12]	KIM W,LEE H,KIM H J.Predictive modeling of time-varying environmental information for path planning[C]//Proceedings of IEEE International Conference on Systems.Washington D.C.,USA:IEEE Press,2013:3639-3644.
[13]	MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518(7540):529-533.
[14]	TAMAR A,WU Y,THOMAS G,et al.Value iteration networks[C]//Proceedings of International Joint Conference on Artificial Intelligence.Washington D.C.,USA:IEEE Press,2016:26-31.
[15]	SHANI G,PINEAU J,KAPLOW R.A survey of point-based POMDP solvers[J].Autonomous Agents and Multi-Agent Systems,2013,27(1):1-51.
[16]	SONDIK E J.The optimal control of partially observable Markov processes over the infinite horizon:discounted costs[J].Operations Research,1978,26(2):282-304.
[17]	MURPHY K P.A survey of POMDP solution techniques[EB/OL].[2019-11-04].https://www.researchgate.net/publication/2275247_A_survey_of_POMDP_solution_techniques.
[18]	KOUTNÍK J,GREFF K,GOMEZ F,et al.A clockwork RNN[EB/OL].[2019-11-04].https://arxiv.org/abs/1402.3511.
[19]	PASCANU R,MIKOLOV T,BENGIO Y.On the difficulty of training recurrent neural networks[C]//Proceedings of the 30th International Conference on Machine Learning.Washington D.C.,USA:IEEE Press,2013:1310-1318.
[20]	CHO K,VAN MERRIENBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing.Pittsburgh,USA:Association for Computational Linguistics,2014:1724-1734.
[21]	WERBOS P J.Backpropagation through time:what it does and how to do it[J].Proceedings of the IEEE,1990,78(10):1550-1560.

选择文件类型/文献管理软件名称

选择包含的内容

基于循环卷积神经网络的POMDP值迭代算法

Value Iteration Algorithm for POMDP Based on Recurrent Convolutional Neural Network

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 3

参考文献 21

相关文章 15

编辑推荐

Metrics

本文评价

[1]	王志浩, 钱沄涛. 基于Swin Transformer的双流遥感图像时空融合超分辨率重建[J]. 计算机工程, 2024, 50(9): 33-45.
[2]	李俊俊, 董建刚, 李坤. 基于Kubernetes的集群节能策略研究[J]. 计算机工程, 2024, 50(9): 82-91.
[3]	张鲁, 田春伟, 宋焕生, 刘侍刚. 用于低剂量CT图像去噪的多级双树复小波网络[J]. 计算机工程, 2024, 50(9): 266-275.
[4]	高煜宝, 文志诚. 基于注意力机制的双路解码器图像去噪方法[J]. 计算机工程, 2024, 50(9): 324-332.
[5]	王蕾, 党时鹏, 潘丰. 基于卷积神经网络的隐匿性旁路预测模型[J]. 计算机工程, 2024, 50(8): 40-49.
[6]	耿丽丽, 牛保宁. 基于通道相似度熵的卷积神经网络裁剪[J]. 计算机工程, 2024, 50(7): 133-143.
[7]	张洋, 刘畅, 李少青. 基于可控制性度量的图神经网络门级硬件木马检测方法[J]. 计算机工程, 2024, 50(7): 164-173.
[8]	牛瑞婷, 严天峰, 高锐, 王映植. 低信噪比下基于深度学习TCNN-MobileNet的调制识别[J]. 计算机工程, 2024, 50(7): 204-215.
[9]	张溢文, 蔡满春, 陈咏豪, 朱懿, 姚利峰. 融合空间特征的多尺度深度伪造检测方法[J]. 计算机工程, 2024, 50(7): 240-250.
[10]	逯焕宇, 张永宏, 马光义, 谢东林, 田伟. 基于半监督对抗学习的遥感图像水体提取[J]. 计算机工程, 2024, 50(7): 251-263.
[11]	于洋, 孙芳芳, 吕华, 李扬, 王晓民. 基于多尺度时空注意力网络的微表情检测方法[J]. 计算机工程, 2024, 50(6): 228-235.
[12]	张雷, 沈国琛, 欧冬秀. 用于热成像数据的卷积神经网络特征图筛选方法[J]. 计算机工程, 2024, 50(4): 31-40.
[13]	张雷, 沈国琛, 欧冬秀. 用于热成像数据的卷积神经网络特征图筛选方法[J]. 计算机工程, 2024, 50(4): 31-40.
[14]	李政学, 李枝名, 彭德中, 陈杰. 基于特征对比学习和图卷积的社交网络用户分类[J]. 计算机工程, 2024, 50(4): 258-266.
[15]	姜百浩, 刘静, 仇大伟, 姜良. 深度学习在脊柱图像分割中的应用综述[J]. 计算机工程, 2024, 50(3): 1-15.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于循环卷积神经网络的POMDP值迭代算法

Value Iteration Algorithm for POMDP Based on Recurrent Convolutional Neural Network

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 3

参考文献 21

相关文章 15

编辑推荐

Metrics

本文评价