作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (12): 296-303,311. doi: 10.19678/j.issn.1000-3428.0063438

• 开发研究与工程应用 • 上一篇    下一篇

基于联合Q值分解的强化学习网约车订单派送

黄晓辉, 张雄, 杨凯铭, 熊李艳   

  1. 华东交通大学 信息工程学院, 南昌 330013
  • 收稿日期:2021-12-02 修回日期:2022-01-13 发布日期:2022-12-07
  • 作者简介:黄晓辉(1984—),男,副教授、博士,主研方向为深度学习、智慧交通;张雄、杨凯铭,硕士研究生;熊李艳,教授。
  • 基金资助:
    国家自然科学基金(62062033,62067002,61967006);江西省自然科学基金青年重点项目(20192ACBL21006);江西省自然科学基金面上项目(20212BAB202008)。

Reinforcement Learning Online Car-Hailing Order Dispatch Based on Joint Q-value Decomposition

HUANG Xiaohui, ZHANG Xiong, YANG Kaiming, XIONG Liyan   

  1. School of Information Engineering, East China Jiaotong University, Nanchang 330013, China
  • Received:2021-12-02 Revised:2022-01-13 Published:2022-12-07

摘要: 因网约车订单派送不合理,导致资源利用率和出行效率降低。基于联合Q值函数分解的框架,提出两种订单派送方法ODDRL和LF-ODDRL,高效地将用户订单请求派送给合适的网约车司机,尽可能缩短乘客等待时间。为捕获网约车订单派送场景中随机需求与供应动态变化关系,把城市定义为一张四边形网格的地图,将每辆车视为一个独立的智能体,构建多智能体马尔可夫决策过程模型,通过最大化熵与累计奖励训练智能体。将多智能体的联合Q值函数转化为易分解函数,使联合Q值函数与单个智能体值函数中的动作具有一致性,同时设计动作搜索函数,结合集中训练、分散执行策略的优点,让每辆车以分布式的方式解决订单匹配问题,而不需要与其他车辆进行协调,从而降低复杂性。实验结果表明,相比Random、Greedy、QMIX等方法,所提ODDRL和LF-ODDRL具有较优的扩展性,其中,在500×500网格上,当乘客数为10、车辆数为2时,相对于QMIX方法接送乘客所产生的总时间分别缩短5%和12%。

关键词: 多智能体, 强化学习, 值函数, 订单派送, 神经网络

Abstract: Resource utilization and travel efficiency are often reduced owing to an unreasonable dispatch of online car-hailing orders.Based on the joint Q-value function decomposition framework, two order dispatch methods, ODDRL and LF-ODDRL, are proposed to efficiently dispatch user requests to appropriate online car-hailing drivers to minimize passenger waiting times.To capture the dynamic change relationship between random demand and supply in the online car-hailing order dispatch scenario, the city is defined as a quadrilateral grid map, and each vehicle is considered as an independent agent.A multi-agent Markov Decision Process(MDP) model is developed to train agents by optimizing entropy and cumulative rewards.The joint Q-value function of multi-agents is transformed into a decomposable function so that the actions in the joint Q-value function and the value function of a single agent are consistent.At the same time, the action search function is designed by combining the benefits of centralized training and decentralized execution strategy so that each vehicle can solve the order matching problem in a distributed manner without coordinating with other vehicles, thereby reducing complexity.The experimental results demonstrate that the proposed ODDRL and LF-ODDRL have better scalability than Random, Greedy, QMIX, and other methods.On the 500×500 grid, when the number of passengers is 10 and the number of vehicles is 2, the total time for picking up is shorten by 5% and 12% respectively, when compared to the QMIX method.

Key words: multi-agent, reinforcement learning, value function, order dispatch, neural network

中图分类号: