基于YOLO-Pose的城市街景小目标行人姿态估计算法

doi:10.19678/j.issn.1000-3428.0067733

摘要/Abstract

摘要：

现有的姿态估计算法在城市街景中对小目标行人的检测效果不佳。针对该问题, 提出一种基于YOLO-Pose的小目标行人姿态估计算法YOLO-Pose-CBAM。通过引入CBAM注意力机制模块, 在不增加过多计算量的前提下, 增强网络聚焦小目标行人区域的能力, 提升算法对小目标行人的敏感度, 同时在主干网络中使用4个不同尺寸的检测头, 丰富算法对图片中不同大小行人的检测手段; 在骨干网络和颈部之间架设2条跨层级联通道, 提升浅层网络与深层网络之间的特征融合能力, 进一步增强信息交流, 降低小目标行人漏检率; 引入SIoU重新定义边界框回归的定位损失函数, 加快训练的收敛速度, 提高检测精度; 采用k-means++算法代替k-means算法对数据集中标注的锚框进行聚类, 避免聚类中心初始化时导致的局部最优解问题, 从而选择出更适合检测小目标行人的锚框。对比实验结果表明, 在小目标行人WiderKeypoints数据集上, 所提算法相较于YOLO-Pose和YOLOv7-Pose在平均精度上分别提升了4.6和6.5个百分比。

关键词: YOLO-Pose算法, 姿态估计, 跨层级联, CBAM注意力机制, SIoU损失函数, k-means++算法

Abstract:

To address the problem that existing attitude estimation algorithms are not effective in detecting small target pedestrians in an urban streetscape, this study proposes a pose estimation algorithm for small target pedestrian, YOLO-Pose-CBAM, based on YOLO-Pose. First, the CBAM attention mechanism module is introduced to enhance the ability of the network to focus on small target pedestrian areas and improve the sensitivity of the algorithm to small target pedestrians on the premise of not increasing the computation excessively. Simultaneously, four detection heads of different sizes are used in the trunk network to enrich the detection means of the algorithm for pedestrians of different sizes. Second, two cross layer cascading channels are constructed between the Backbone and Neck, which improves the feature fusion ability between the shallow and deep networks, further enhancing the information exchange and reducing the missed rate of small target pedestrians. Furthermore, the SIoU is introduced to redefine the location loss function of the boundary box regression, which can accelerate the convergence speed of the training and improve the detection accuracy. Finally, the k-means++ algorithm is used instead of the k-means algorithm to cluster the tagged anchor frames in the dataset, avoiding the local optimal solution problem caused by the initialization of the clustering center to select the anchor frame that is more suitable for detecting small target pedestrians. Compared with the experimental results, the Average Precision(AP) of the proposed algorithm for the small target pedestrian WiderKeypoints dataset is improved by 4.6 percentage points compared with that of YOLO-Pose and by 6.5 percentage points compared with that of YOLOv7-Pose.

Key words: YOLO-Pose algorithm, pose estimation, cross layer cascading, CBAM attention mechanism, SIoU loss function, k-means++ algorithm

马明旭, 马宏, 宋华伟. 基于YOLO-Pose的城市街景小目标行人姿态估计算法[J]. 计算机工程, 2024, 50(4): 177-186.

Mingxu MA, Hong MA, Huawei SONG. Pose Estimation Algorithm for Small Target Pedestrians in Urban Street View Based on YOLO-Pose[J]. Computer Engineering, 2024, 50(4): 177-186.

http://www.ecice06.com/CN/Y2024/V50/I4/177

图/表 13

图1 YOLO-Pose网络结构

Fig.1 YOLO-Pose network structure

图2 改进的YOLO-Pose-CBAM网络结构

Fig.2 Improved YOLO-Pose-CBAM network structure

图3 CBAM注意力机制

Fig.3 CBAM attention mechanism

图4 通道注意力模块

Fig.4 Channel attention module

图5 空间注意力模块

Fig.5 Spatial attention module

图6 跨层级联的特征融合结构

Fig.6 Cross layer cascading feature fusion structure

图7 损失函数变化曲线对比

Fig.7 Comparison of change curve of loss function

图8 测试集检测效果对比

Fig.8 Comparison of test set detection effects

图9 街景行人检测效果对比

Fig.9 Comparison of pedestrian detection effects in street views

参考文献 32

1	ZOU Z X, CHEN K Y, SHI Z W, et al. Object detection in 20 years: a survey. Proceedings of the IEEE, 2023, 111 (3): 257- 276. doi: 10.1109/JPROC.2023.3238524
2	刘勇, 李杰, 张建林, 等. 基于深度学习的二维人体姿态估计研究进展. 计算机工程, 2021, 47 (3): 1- 16. URL
	LIU Y, LI J, ZHANG J L, et al. Research progress of two-dimensional human pose estimation based on deep learning. Computer Engineering, 2021, 47 (3): 1- 16. URL
3	XIAO B, WU H P, WEI Y C. Simple baselines for human pose estimation and tracking[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 472-487.
4	CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 7103-7112.
5	SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2019: 5693-5703.
6	TOSHEV A, SZEGEDY C. DeepPose: human pose estimation via deep neural networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2014: 1653-1660.
7	FANG H S, XIE S Q, TAI Y W, et al. RMPE: regional multi-person pose estimation[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 2334-2343.
8	NEWELL A, HUANG Z A, DENG J. Associative embedding: end-to-end learning for joint detection and grouping[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 2274-2284.
9	CHENG B W, XIAO B, WANG J D, et al. HigherHRNet: scale-aware representation learning for bottom-up human pose estimation[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 5386-5395.
10	KREISS S, BERTONI L, ALAHI A. PifPaf: composite fields for human pose estimation[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2019: 11977-11986.
11	CAO Z, SIMON T, WEI S, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 7291-7299.
12	MAJI D, NAGORI S, MATHEW M, et al. YOLO-Pose: enhancing YOLO for multi person pose estimation using object keypoint similarity loss[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2022: 2637-2646.
13	JOCHER G. YOLOv5 release v6.1[EB/OL]. [2023-02-10]. https://github.com/ultralytics/yolov5/releases/tag/v6.1.
14	QIU H B, WANG C Y, WANG J D, et al. Cross view fusion for 3D human pose estimation[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2019: 4342-4351.
15	DU S L, WANG H, YUAN Z W, et al. Bi-Pose: bidirectional 2D-3D transformation for human pose estimation from a monocular camera[EB/OL]. [2023-02-10]. https://ieeexplore.ieee.org/document/10141872.
16	LIU S G, LI Y, HUA G G. Human pose estimation in video via structured space learning and halfway temporal evaluation. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29 (7): 2029- 2038. doi: 10.1109/TCSVT.2018.2858828
17	ZHU X K, LYU S C, WANG X, et al. TPH-YOLOv5: improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2021: 2778-2788.
18	王程, 刘元盛, 刘圣杰. 基于改进YOLOv4的小目标行人检测算法. 计算机工程, 2023, 49 (2): 296-302, 313. URL
	WANG C, LIU Y S, LIU S J. Small-target pedestrian-detection algorithm based on improved YOLOv4. Computer Engineering, 2023, 49 (2): 296-302, 313. URL
19	TAN M X, PANG R M, LE Q V. EfficientDet: scalable and efficient object detection[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 10781-10790.
20	LIU S, QI L, QIN H F, et al. Path aggregation network for instance segmentation[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 8759-8768.
21	GEVORGYAN Z. SIoU Loss: more powerful learning for bounding box regression[EB/OL]. [2023-02-10]. https://arxiv.org/abs/2205.12740.
22	胡欣, 周运强, 肖剑, 等. 基于改进YOLOv5的螺纹钢表面缺陷检测. 图学学报, 2023, 44 (3): 427- 437. URL
	HU X, ZHOU Y Q, XIAO J, et al. Surface defect detection of threaded steel based on improved YOLOv5. Journal of Graphics, 2023, 44 (3): 427- 437. URL
23	ARTHUR D, VASSILVITSKII S. k-means++: the advantages of careful seeding[C]//Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. New York, USA: ACM Press, 2007: 1027-1035.
24	WANG C Y, LIAO H Y M, WU Y H, et al. CSPNet: a new backbone that can enhance learning capability of CNN[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 390-391.
25	ZHENG Z H, WANG P, REN D W, et al. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Transactions on Cybernetics, 2022, 52 (8): 8574- 8586. doi: 10.1109/TCYB.2021.3095305
26	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of ECCV 2014. Berlin, Germany: Springer, 2014: 740-755.
27	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of ECCV 2018. Berlin, Germany: Springer, 2018: 3-19.
28	LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 2117-2125.
29	BOCHKOVSKIY A, WANG C Y, LIAO H. YOLOv4: optimal speed and accuracy of object detection[EB/OL]. [2023-02-10]. https://arxiv.org/abs/2004.10934.
30	LI J F, WANG C, ZHU H, et al. CrowdPose: efficient crowded scenes pose estimation and a new benchmark[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2019: 10863-10872.
31	ZHANG S F, XIE Y L, WAN J, et al. WiderPerson: a diverse dataset for dense pedestrian detection in the wild. IEEE Transactions on Multimedia, 2020, 22 (2): 380- 393. doi: 10.1109/TMM.2019.2929005
32	WANG C Y, BOCHKOVSKIY A, LIAO H Y M. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2023: 7464-7475.

[1]	龙辰志, 陈平, 李传坤. 融合全局-局部上下文信息的小目标多人姿态估计[J]. 计算机工程, 2024, 50(4): 342-349.
[2]	赵佳圆, 张玉茹, 苏晓东, 徐红岩, 李世洲, 张玉荣. 基于注意力机制的人体关键点隐式建模网络[J]. 计算机工程, 2024, 50(3): 317-325.
[3]	刘豪, 吴红兰, 房宇轩. 结合全局上下文信息的高效人体姿态估计[J]. 计算机工程, 2023, 49(7): 102-109.
[4]	王款, 宣士斌, 何雪东, 李紫薇, 李嘉祥. 基于交叉注意力Transformer的人体姿态估计方法[J]. 计算机工程, 2023, 49(7): 223-231.
[5]	钟宝荣, 吴夏灵. 基于高分辨率网络的轻量型人体姿态估计研究[J]. 计算机工程, 2023, 49(4): 226-232,239.
[6]	付齐, 谢凯, 文畅, 贺建飚. 遮挡与几何感知模型下的头部姿态估计方法[J]. 计算机工程, 2023, 49(3): 296-303,311.
[7]	张雯雯, 徐杨, 白芮, 陈娜. 基于改进堆叠沙漏网络的动物姿态估计[J]. 计算机工程, 2023, 49(2): 263-270.
[8]	谢云旭, 吴锡, 彭静. 无锚框模型类梯度全局对抗样本生成[J]. 计算机工程, 2023, 49(10): 186-193.
[9]	罗梦诗, 徐杨, 叶星鑫. 融入双注意力的高分辨率网络人体姿态估计[J]. 计算机工程, 2022, 48(2): 314-320.
[10]	王柳程, 欧阳城添, 梁文. 基于改进特征金字塔网络的人体姿态估计[J]. 计算机工程, 2021, 47(8): 251-259,270.
[11]	傅由甲. 基于面部特征点的单幅图像人脸姿态估计方法[J]. 计算机工程, 2021, 47(4): 197-203,210.
[12]	刘勇, 李杰, 张建林, 徐智勇, 魏宇星. 基于深度学习的二维人体姿态估计研究进展[J]. 计算机工程, 2021, 47(3): 1-16.
[13]	黄凤琪, 陈明, 冯国富. 基于可变形卷积的改进YOLO目标检测算法[J]. 计算机工程, 2021, 47(10): 269-275,282.
[14]	闫航, 陈刚, 佟瑶, 姬波, 胡北辰. 基于姿态估计与GRU网络的人体康复动作识别[J]. 计算机工程, 2021, 47(1): 12-20.
[15]	郑伟成, 李学伟, 刘宏哲, 代松银. 基于深度学习的疲劳驾驶检测算法[J]. 计算机工程, 2020, 46(7): 21-29.

选择文件类型/文献管理软件名称

选择包含的内容