作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (9): 328-339. doi: 10.19678/j.issn.1000-3428.0069559

• 开发研究与工程应用 • 上一篇    下一篇

基于深度强化学习的外卖即时配送实时优化

陈彦如*(), 刘珂良, 冉茂亮   

  1. 西南交通大学经济管理学院,四川 成都 610031
  • 收稿日期:2024-03-13 修回日期:2024-05-19 出版日期:2025-09-15 发布日期:2025-09-26
  • 通讯作者: 陈彦如
  • 基金资助:
    国家自然科学基金(72371206)

Real-time Optimization of Instant Meal Delivery Based on Deep Reinforcement Learning

CHEN Yanru*(), LIU Keliang, RAN Maoliang   

  1. School of Economics and Management, Southwest Jiaotong University, Chengdu 610031, Sichuan, China
  • Received:2024-03-13 Revised:2024-05-19 Online:2025-09-15 Published:2025-09-26
  • Contact: CHEN Yanru

摘要:

为了应对外卖配送任务在用餐高峰期运力紧张、订单延迟送达率高的挑战,提出一种基于深度强化学习(DRL)的外卖即时配送实时优化策略,以提升外卖平台长期客户服务水平。首先,充分考虑外卖配送中备餐时间、取送顺序、时间窗等约束,以最大化期望平均客户服务水平为目标,建立考虑随机需求的外卖即时配送问题的马尔可夫决策过程(MDP)模型;其次,设计一种结合近似策略优化(PPO)算法和插入启发式(IH)算法的外卖即时配送优化策略PPO-IH。PPO-IH使用融合注意力机制的选择策略网络对订单-骑手进行匹配,通过PPO算法对网络进行训练,并使用插入启发式算法更新骑手路径。最后,通过与贪婪策略(Greedy)、最小差值策略、分配启发式以及两种深度强化学习算法进行对比实验,结果表明。PPO-IH分别在71.5%、95.5%、87.5%、79.5%与70.0%时段数据中表现更优,同时平均客户服务水平更高,平均每单配送时间更短、延迟送达率更低。此外,PPO-IH在不同骑手数、不同订单密度以及不同订单时间窗场景下具有一定的有效性和泛化性。

关键词: 外卖配送, 实时优化, 深度强化学习, 马尔可夫决策过程, 近似策略优化, 注意力机制

Abstract:

To address the challenges of tight capacity and high delayed rate of meal delivery tasks during peak dining period, a real-time optimization policy based on Deep Reinforcement Learning (DRL) for instant meal delivery is proposed to improve the long-term customer service level of platforms. First, considering the constraints of meal preparation time, pickup and delivery sequence, and time window in meal delivery, the instant meal delivery problem with stochastic requests is modeled as a Markov Decision Process (MDP) to maximize the expected average customer service level. Second, the Proximity Policy Optimization (PPO) algorithm is combined with the Insertion Heuristic (IH) algorithm to design an instant meal delivery optimization policy, PPO-IH. A policy network with an integrated attention mechanism is employed by PPO-IH for matching orders to couriers, and the network is trained by the PPO algorithm. The courier routes are updated with an IH algorithm. Finally, through comparative experiments with the Greedy, minimum difference strategy, allocation heuristic, and two deep reinforcement learning algorithms, PPO-IH is shown to perform better in 71.5%, 95.5%, 87.5%, 79.5%, and 70.0% days with the given data, respectively. Additionally, PPO-IH achieves a higher average level of customer service, shorter average delivery times per order, and a lower rate of delayed deliveries. Furthermore, PPO-IH demonstrates certain effectiveness and generalization under different rider numbers, order densities, and order time window scenarios.

Key words: meal delivery, real-time optimization, Deep Reinforcement Learning(DRL), Markov Decision Process(MDP), Proximal Policy Optimazation(PPO), attention mechanism