基于深度强化学习的外卖即时配送实时优化

doi:10.19678/j.issn.1000-3428.0069559

摘要/Abstract

摘要：

为了应对外卖配送任务在用餐高峰期运力紧张、订单延迟送达率高的挑战，提出一种基于深度强化学习(DRL)的外卖即时配送实时优化策略，以提升外卖平台长期客户服务水平。首先，充分考虑外卖配送中备餐时间、取送顺序、时间窗等约束，以最大化期望平均客户服务水平为目标，建立考虑随机需求的外卖即时配送问题的马尔可夫决策过程(MDP)模型；其次，设计一种结合近似策略优化(PPO)算法和插入启发式(IH)算法的外卖即时配送优化策略PPO-IH。PPO-IH使用融合注意力机制的选择策略网络对订单-骑手进行匹配，通过PPO算法对网络进行训练，并使用插入启发式算法更新骑手路径。最后，通过与贪婪策略(Greedy)、最小差值策略、分配启发式以及两种深度强化学习算法进行对比实验，结果表明。PPO-IH分别在71.5%、95.5%、87.5%、79.5%与70.0%时段数据中表现更优，同时平均客户服务水平更高，平均每单配送时间更短、延迟送达率更低。此外，PPO-IH在不同骑手数、不同订单密度以及不同订单时间窗场景下具有一定的有效性和泛化性。

关键词: 外卖配送, 实时优化, 深度强化学习, 马尔可夫决策过程, 近似策略优化, 注意力机制

Abstract:

To address the challenges of tight capacity and high delayed rate of meal delivery tasks during peak dining period, a real-time optimization policy based on Deep Reinforcement Learning (DRL) for instant meal delivery is proposed to improve the long-term customer service level of platforms. First, considering the constraints of meal preparation time, pickup and delivery sequence, and time window in meal delivery, the instant meal delivery problem with stochastic requests is modeled as a Markov Decision Process (MDP) to maximize the expected average customer service level. Second, the Proximity Policy Optimization (PPO) algorithm is combined with the Insertion Heuristic (IH) algorithm to design an instant meal delivery optimization policy, PPO-IH. A policy network with an integrated attention mechanism is employed by PPO-IH for matching orders to couriers, and the network is trained by the PPO algorithm. The courier routes are updated with an IH algorithm. Finally, through comparative experiments with the Greedy, minimum difference strategy, allocation heuristic, and two deep reinforcement learning algorithms, PPO-IH is shown to perform better in 71.5%, 95.5%, 87.5%, 79.5%, and 70.0% days with the given data, respectively. Additionally, PPO-IH achieves a higher average level of customer service, shorter average delivery times per order, and a lower rate of delayed deliveries. Furthermore, PPO-IH demonstrates certain effectiveness and generalization under different rider numbers, order densities, and order time window scenarios.

Key words: meal delivery, real-time optimization, Deep Reinforcement Learning(DRL), Markov Decision Process(MDP), Proximal Policy Optimazation(PPO), attention mechanism

陈彦如, 刘珂良, 冉茂亮. 基于深度强化学习的外卖即时配送实时优化[J]. 计算机工程, 2025, 51(9): 328-339.

CHEN Yanru, LIU Keliang, RAN Maoliang. Real-time Optimization of Instant Meal Delivery Based on Deep Reinforcement Learning[J]. Computer Engineering, 2025, 51(9): 328-339.

https://www.ecice06.com/CN/Y2025/V51/I9/328

图/表 11

图1 外卖配送问题示意图

Fig.1 Schematic diagram of meal delivery problem

图2 PPO算法流程

Fig.2 PPO algorithm procedure

图3 本文策略网络结构

Fig.3 The network structure of the strategy in this paper

图4 各时段订单密度比例

Fig.4 The proportion of order density in each time period

图5 策略网络的训练曲线

Fig.5 The training curve of the strategy network

图6 不同运力挑战下算法性能箱线图

Fig.6 Box plot of algorithms performance under different capacity challenges

图7 各决策点活跃订单平均服务水平差值图

Fig.7 Graph of average service level differences for active orders at each decision points

参考文献 25

1	华经情报网. 2022年中国网上外卖行业分析[EB/OL]. [2024-02-10]. https://www.huaon.com/.
	Huajing Intelligence Network. Analysis of China's online food delivery industry in 2022. [EB/OL]. [2024-02-10]. https://www.huaon.com/. (in Chinese)
2	HILDEBRANDT F D , THOMAS B W , ULMER M W . Opportunities for reinforcement learning in stochastic dynamic vehicle routing. Computers & Operations Research, 2023, 150, 106071.
3	张玉州, 叶亮, 郑军帅. 基于滚动时域控制的动态外卖配送问题优化. 计算机技术与发展, 2019, 29 (10): 83-88, 94.
	ZHANG Y Z , YE L , ZHENG J S . Optimization of dynamic takeaway distribution problem based on receding horizon control. Computer Technology and Development, 2019, 29 (10): 83-88, 94.
4	李桃迎, 吕晓宁, 李峰, 等. 考虑动态需求的外卖配送路径优化模型及算法. 控制与决策, 2019, 34 (2): 406- 413.
	LI T Y , LYU X N , LI F , et al. Routing optimization model and algorithm for takeout distribution with multiple fuzzy variables under dynamics demand. Control and Decision, 2019, 34 (2): 406- 413.
5	BOZANTA A , CEVIK M , KAVAKLIOGLU C , et al. Courier routing and assignment for food delivery service using reinforcement learning. Computers & Industrial Engineering, 2022, 164, 107871.
6	JAHANSHAHI H , BOZANTA A , CEVIK M , et al. A deep reinforcement learning approach for the meal delivery problem. Knowledge-Based Systems, 2022, 243, 108489. doi: 10.1016/j.knosys.2022.108489
7	ZOU G Y , TANG J F , YILMAZ L , et al. Online food ordering delivery strategies based on deep reinforcement learning. Applied Intelligence, 2022, 52 (6): 6853- 6865. doi: 10.1007/s10489-021-02750-3
8	SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[EB/OL]. [2024-02-10]. https://arxiv.org/abs/1707.06347v2.
9	余海燕, 唐婉倩, 吴腾宇. 带硬时间窗的O2O生鲜外卖即时配送路径优化. 系统管理学报, 2021, 30 (3): 584- 591.
	YU H Y , TANG WQ , WU T Y . Vehicle routing problem with hard time windows for instant delivery of O2O fresh takeout orders. Journal of Systems & Management, 2021, 30 (3): 584- 591.
10	冯爱兰, 周映雪, 龚艳茹, 等. 抢派结合模式下外卖配送问题研究. 控制与决策, 2024, 39 (9): 3135- 3142.
	FENG A L , ZHOU Y X , GONG Y R , et al. Research on takeout distribution based on combination mode of order dispatching and grabbing. Control and Decision, 2024, 39 (9): 3135- 3142.
11	余海燕, 蒋仁莲. 基于众包平台的外卖实时配送订单分配与路径优化研究. 工业工程与管理, 2022, 27 (2): 146- 152.
	YU H Y , JIANG R L . Study on the real-time order allocation and routing problem of takeout food distribution on crowdsourcing platform. Industrial Engineering and Management, 2022, 27 (2): 146- 152.
12	STEEVER Z , KARWAN M , MURRAY C . Dynamic courier routing for a food delivery service. Computers & Operations Research, 2019, 107, 173- 188.
13	ULMER M W , THOMAS B W , CAMPBELL A M , et al. The restaurant meal delivery problem: dynamic pickup and delivery with deadlines and random ready times. Transportation Science, 2021, 55 (1): 75- 100. doi: 10.1287/trsc.2020.1000
14	CHEN J F , WANG L , REN H , et al. An imitation learning-enhanced iterated matching algorithm for on-demand food delivery. IEEE Transactions on Intelligent Transportation Systems, 2022, 23 (10): 18603- 18619. doi: 10.1109/TITS.2022.3163263
15	WANG X , WANG L , DONG C X , et al. An online deep reinforcement learning-based order recommendation framework for rider-centered food delivery system. IEEE Transactions on Intelligent Transportation Systems, 2023, 24 (5): 5640- 5654. doi: 10.1109/TITS.2023.3237580
16	WANG X , WANG L , DONG C X , et al. Reinforcement learning-based dynamic order recommendation for on-demand food delivery. Tsinghua Science and Technology, 2023, 29 (2): 356- 367.
17	MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep reinforcement learning[EB/OL]. [2024-02-10]. https://arxiv.org/abs/1312.5602v1.
18	KOOL W, VAN HOOF H, WELLING M. Attention, learn to solve routing problems![EB/OL]. [2024-02-10]. https://arxiv.org/abs/1803.08475v3.
19	XU Z, LI Z X, GUAN Q W, et al. Large-scale order dispatch in on-demand ride-hailing platforms: a learning and planning approach[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, USA: ACM Press, 2018: 905-913.
20	黄晓辉, 杨凯铭, 凌嘉壕. 基于共享注意力的多智能体强化学习订单派送. 计算机应用, 2023, 43 (5): 1620- 1624.
	HUANG X H , YANG K M , LING J H . Order dispatching by multi-agent reinforcement learning based on shared attention. Journal of Computer Applications, 2023, 43 (5): 1620- 1624.
21	WEI C , WANG Y H , YAN X D , et al. Look-ahead insertion policy for a shared-taxi system based on reinforcement learning. IEEE Access, 2018, 6, 5716- 5726. doi: 10.1109/ACCESS.2017.2769666
22	CHEN X W , ULMER M W , THOMAS B W . Deep Q-learning for same-day delivery with vehicles and drones. European Journal of Operational Research, 2022, 298 (3): 939- 952. doi: 10.1016/j.ejor.2021.06.021
23	ULMER M W , GOODSON J C , MATTFELD D C , et al. On modeling stochastic dynamic vehicle routing problems. EURO Journal on Transportation and Logistics, 2020, 9 (2): 100008. doi: 10.1016/j.ejtl.2020.100008
24	BATTA R , LEJEUNE M , PRASAD S . Public facility location using dispersion, population, and equity criteria. European Journal of Operational Research, 2014, 234 (3): 819- 829. doi: 10.1016/j.ejor.2013.10.032
25	熊浩, 郭昊颖, 鄢慧丽, 等. 外卖配送路径多目标实时优化研究. 工业工程, 2023, 26 (1): 98- 107.
	XIONG H , GUO H Y , YAN H L , et al. Multi-objective real-time optimization study of takeaway vehicle routes problem. Industrial Engineering Journal, 2023, 26 (1): 98- 107.

[1]	秦敏浩, 孙未未. 基于隐状态预测的失真交通信号灯路口控制策略[J]. 计算机工程, 2025, 51(9): 1-13.
[2]	黄金贵, 刘朋, 唐文胜. MMD-YOLOv7:黑暗条件下车辆检测方法[J]. 计算机工程, 2025, 51(9): 340-349.
[3]	符家成, 田瑾, 张玉金, 方志军. 结合前置三元组集的知识图谱推荐[J]. 计算机工程, 2025, 51(9): 101-109.
[4]	翟志鹏, 曹阳, 沈琴琴, 施佺. 基于多时空图融合与动态注意力的交通流预测[J]. 计算机工程, 2025, 51(9): 139-148.
[5]	马淦, 谷雨, 彭冬亮. 结合改进YOLOv5s和动态数据增强的海面舰船检测[J]. 计算机工程, 2025, 51(9): 294-305.
[6]	崔萌萌, 施静燕, 项昊龙. 基于空地协同的动态车载边缘任务卸载方法[J]. 计算机工程, 2025, 51(9): 25-37.
[7]	张昭理, 李家豪, 刘海, 石佛波, 何嘉文. 基于个性化遗忘建模的知识追踪方法[J]. 计算机工程, 2025, 51(8): 120-130.
[8]	闫建红, 刘芝妍, 王震. 融合时空注意力机制的多尺度卷积车辆轨迹预测[J]. 计算机工程, 2025, 51(8): 406-414.
[9]	倪源松, 韩军, 邹小燕, 胡广怡, 王文帅. 两阶段自适应分块输电线路螺栓缺陷检测方法[J]. 计算机工程, 2025, 51(8): 281-291.
[10]	郝宏达, 罗健旭. 基于多尺度区域特征融合的多器官语义分割模型[J]. 计算机工程, 2025, 51(8): 270-280.
[11]	彭菊红, 张弛, 高谦, 张光明, 谈栋华, 赵明俊. 基于改进的YOLOv8算法的钢材缺陷检测[J]. 计算机工程, 2025, 51(7): 152-160.
[12]	宋杰, 徐慧英, 朱信忠, 黄晓, 陈晨, 王泽宇. 基于YOLOv8改进的跌倒检测算法: OEF-YOLO[J]. 计算机工程, 2025, 51(7): 127-139.
[13]	刘春霞, 孟吉星, 潘理虎, 龚大立. 融合RGB与IR图像的遥感小目标检测方法[J]. 计算机工程, 2025, 51(7): 326-338.
[14]	栾孟娜, 郑秋梅, 王风华. 基于DMC-YOLO的交通标志实时检测算法[J]. 计算机工程, 2025, 51(7): 90-99.
[15]	华家宝, 张京瑞, 朱福民, 陈璐. 基于路侧相机的自适应空间变换车辆检测方法[J]. 计算机工程, 2025, 51(6): 349-359.

选择文件类型/文献管理软件名称

选择包含的内容