基于改进DETR的密集行人检测算法研究

doi:10.19678/j.issn.1000-3428.0070106

计算机工程 ›› 2026, Vol. 52 ›› Issue (5): 250-258. doi: 10.19678/j.issn.1000-3428.0070106

• 计算机视觉与图形图像处理 • 上一篇下一篇

基于改进DETR的密集行人检测算法研究

宋天泽¹^,², 曹从军¹^,²^,*(), 何佳琪¹^,², 王旭升¹^,², 刘晨煜¹^,²

1. 西安理工大学印刷包装与数字媒体学院, 陕西西安 710054
2. 陕西省印刷包装工程技术研究中心, 陕西西安 710054

收稿日期:2024-07-11 修回日期:2024-11-08 出版日期:2026-05-15 发布日期:2024-12-19
通讯作者: 曹从军
作者简介:
宋天泽(CCF学生会员), 男, 硕士研究生, 主研方向为目标检测、图像描述
曹从军(通信作者), 教授、博士
何佳琪, 硕士研究生
王旭升, 副教授、博士
刘晨煜, 硕士研究生
基金资助:
陕西省重点科研基地项目(2023HBGC-18)

Research on Dense Pedestrian Detection Algorithm Based on Improved DETR

SONG Tianze¹^,², CAO Congjun¹^,²^,*(), HE Jiaqi¹^,², WANG Xusheng¹^,², LIU Chenyu¹^,²

1. Faculty of Printing, Packaging Engineering and Digital Media Technology, Xi'an University of Technology, Xi'an 710054, Shaanxi, China
2. Printing and Packaging Engineering Technology Research Center of Shaanxi Province, Xi'an 710054, Shaanxi, China

Received:2024-07-11 Revised:2024-11-08 Online:2026-05-15 Published:2024-12-19
Contact: CAO Congjun

摘要/Abstract

摘要：

密集行人检测是行人检测领域的一大研究热点。针对密集行人检测场景中被遮挡目标及小目标行人易漏检的问题, 提出一种改进DETR的目标检测算法Pe-DETR。采用基于多头自注意力机制的Dino-DETR作为基准模型, 因自注意力机制缺少捕获局部特征的能力, 导致密集行人检测效果较差, 对前馈神经网络(FNN)进行改进, 设计通道注意力深度卷积前馈神经网络DWSEFNN, 使模型可以提取到更多局部细节特征。针对ResNet50骨干网络对重要特征提取效率较低的问题, 采用Swin Transformer-L作为特征提取网络, 提升骨干网络对重要特征的提取能力, 同时使Pe-DETR完全基于注意力机制搭建, 结构中不包含深度卷积结构。针对密集行人场景中目标数量多与DETR检测器中稀疏匹配的矛盾问题, 应用密集不同查询有效应对行人密集的场景, 且不会引入无效的相似查询。在CrowdHuman密集行人检测数据集上的实验结果表明, 所提行人检测算法Pe-DETR相比Dino-DETR算法的平均精度(AP)@0.5提高了3.7百分点, AP提高4.5百分点, 在密集行人检测任务中改进后Pe-DETR算法的准确率明显优于其他端到端模型。

关键词: 行人检测, 目标检测, 深度卷积, 迁移学习, 自注意力机制

Abstract:

Dense pedestrian detection is a research hotspot in the field of pedestrian detection. This study proposes an improved DETR target detection algorithm, Pe-DETR, to address the problem of occluded targets and small target pedestrians being prone to missed detection in dense pedestrian detection scenes. This algorithm uses Dino-DETR, which is based on the multi-head self-attention mechanism, as the benchmark model. However, the self-attention mechanism lacks the ability to capture local features, resulting in poor detection of dense pedestrians. To address this issue, this study enhances Feedforward Neural Network (FNN) and proposes channel attention convolutional feedforward neural network DWSEFNN to extract more local detailed features. In response to the low efficiency of the ResNet50 backbone network in extracting important features, Swin Transformer-L is adopted as the feature extraction network. Simultaneously, Pe-DETR is completely built based on the attention mechanism, and the architecture does not contain a deep convolution structure. To handle the contradictions between the large number of targets in dense pedestrian scenes and sparse matching in the DETR detector, densely different queries are applied to handle pedestrian-dense scenes without introducing invalid similar queries. Experimental results on the CrowdHuman dense pedestrian detection dataset show that, compared with the Dino-DETR algorithm, the proposed pedestrian detection algorithm Pe-DETR achieves an improvement of 3.7 percentage points in Average Precision (AP)@0.5 and an increase of 4.5 percentage points in AP. In dense pedestrian detection tasks, the improved Pe-DETR algorithm demonstrates significantly higher accuracy than other end-to-end models.

Key words: pedestrian detection, object detection, depthwise convolution, transfer learning, self-attention mechanism

宋天泽, 曹从军, 何佳琪, 王旭升, 刘晨煜. 基于改进DETR的密集行人检测算法研究[J]. 计算机工程, 2026, 52(5): 250-258.

SONG Tianze, CAO Congjun, HE Jiaqi, WANG Xusheng, LIU Chenyu. Research on Dense Pedestrian Detection Algorithm Based on Improved DETR[J]. Computer Engineering, 2026, 52(5): 250-258.

https://www.ecice06.com/CN/Y2026/V52/I5/250

图/表 13

图1 Pe-DETR网络结构

Fig.1 Structure of Pe-DETR network

图2 Swin Transformer-L结构

Fig.2 Structure of Swin Transformer-L

图3 移动窗口多头自注意力过程

Fig.3 Process of the shifted window multi-head self-attention

图4 迁移学习原理示意图

Fig.4 Schematic diagram of transfer learning principles

图5 常见前馈神经网络与通道注意力深度卷积前馈神经网络结构

Fig.5 Structures of common feedforward neural networks and channel attention depthwise convolutional feedforward neural network

图6 密集不同查询示意图

Fig.6 Schematic diagram of dense distinct queries

图7 Pe-DETR检测结果示例

Fig.7 Example of Pe-DETR′s detection results

参考文献 28

1	宋晓琳. 基于深度学习的行人检测算法研究[D]. 北京: 北京邮电大学, 2023.
	SONG X L. Research on pedestrian detection based on deep learning[D]. Beijing: Beijing University of Posts and Telecommunications, 2023. (in Chinese)
2	CHEN N , LI M L , YUAN H , et al. Survey of pedestrian detection with occlusion. Complex & Intelligent Systems, 2021, 7 (1): 577- 587. URL
3	张宏扬. 基于深度学习的遮挡行人检测研究. 信息技术与信息化, 2023 (6): 217- 220. doi: 10.3969/j.issn.1672-9528.2023.06.055
	ZHANG H Y . Research on occlusion pedestrian detection based on deep learning. Information Technology & Informatization, 2023 (6): 217- 220. doi: 10.3969/j.issn.1672-9528.2023.06.055
4	DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2005: 886-893.
5	VIOLA P , JONES M J . Robust real-time face detection. International Journal of Computer Vision, 2004, 57 (2): 137- 154. doi: 10.1023/B:VISI.0000013087.49260.fb
6	葛斌, 许诺, 夏晨星, 等. 四流输入引导的特征互补可见光-红外行人重识别. 光电工程, 2024, 51 (9): 240119. doi: 10.12086/oee.2024.240119
	GE B , XU N , XIA C X , et al. Quadruple-stream input-guided feature complementary visible-infrared person re-identification. Opto-Electronic Engineering, 2024, 51 (9): 240119. doi: 10.12086/oee.2024.240119
7	LIENHART R, MAYDT J. An extended set of Haar-like features for rapid object detection[C]//Proceedings of International Conference on Image Processing. Washington D. C., USA: IEEE Press, 2002: 1-10.
8	BAY H, TUYTELAARS T, GOOL L V. SURF: speeded up robust features[C]//Proceedings of the 9th European Conference on Computer Vision. Berlin, Germany: Springer, 2006: 52-60.
9	HEARST M A , DUMAIS S T , OSUNA E , et al. Support vector machines. IEEE Intelligent Systems and Their Applications, 1998, 13 (4): 18- 28. doi: 10.1109/5254.708428
10	OPITZ D , MACLIN R . Popular ensemble methods: an empirical study. Journal of Artificial Intelligence Research, 1999, 11, 169- 198. doi: 10.1613/jair.614
11	FREUND Y. Experiments with a new boosting algorithm[C]//Proceeding of International Conference on Machine Learning. [S. l. ]: AAAI Press, 1996: 20-29.
12	FELZENSZWALB P F , GIRSHICK R B , MCALLESTER D , et al. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32 (9): 1627- 1645. doi: 10.1109/TPAMI.2009.167
13	LOWE D G . Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004, 60 (2): 91- 110. doi: 10.1023/B:VISI.0000029664.99615.94
14	OJALA T , PIETIKAINEN M , MAENPAA T . Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24 (7): 971- 987. doi: 10.1109/TPAMI.2002.1017623
15	REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 779-788.
16	REN S Q , HE K M , GIRSHICK R , et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39 (6): 1137- 1149. doi: 10.1109/TPAMI.2016.2577031
17	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 6000-6010.
18	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2024-06-05]. https://arxiv.org/pdf/1810.04805.
19	ZHANG H, LI F, LIU S L, et al. Dino: DETR with improved denoising anchor boxes for end-to-end object detection[EB/OL]. [2024-06-05]. https://arxiv.org/pdf/2203.03605.
20	LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2022: 9992-10002.
21	NEUBECK A, VAN GOOL L. Efficient non-maximum suppression[C]//Proceedings of the 18th International Conference on Pattern Recognition. Washington D. C., USA: IEEE Press, 2006: 850-855.
22	HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[EB/OL]. [2024-06-05]. https://arxiv.org/pdf/1704.04861.
23	SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 4510-4520.
24	YU D B, LI Q W, WANG X L, et al. DSTrans: dual-stream transformer for hyperspectral image restoration[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Washington D. C., USA: IEEE Press, 2023: 3728-3738.
25	LI Y W, ZHANG K, CAO J Z, et al. LocalViT: analyzing locality in vision transformers[C]//Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Washington D. C., USA: IEEE Press, 2023: 9598-9605.
26	SHI D. TransNeXt: robust foveal visual perception for vision transformers[EB/OL]. [2024-06-05]. https://arxiv.org/pdf/2311.17132.
27	CHU X X, TIAN Z, ZHANG B, et al. Conditional positional encodings for vision transformers[EB/OL]. [2024-06-05]. https://arxiv.org/abs/2102.10882.
28	ZHANG S L, WANG X J, WANG J Q, et al. Dense distinct query for end-to-end object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2023: 7329-7338.

[1]	杨家豪, 王雷. 基于多特征时空推理网络的个体关注目标检测[J]. 计算机工程, 2026, 52(5): 184-191.
[2]	魏文泉, 莫宏伟. 基于改进YOLOv5s的PCB缺陷检测算法[J]. 计算机工程, 2026, 52(5): 226-238.
[3]	汤伟博, 方强, 李沛根, 艾龙金, 熊金红, 夏海廷. 基于RSD-YOLO的无人机航拍图像小目标检测[J]. 计算机工程, 2026, 52(4): 214-228.
[4]	李沂杨, 陆声链, 王继杰, 陈明. 基于Transformer的DETR目标检测算法综述[J]. 计算机工程, 2026, 52(4): 62-81.
[5]	李潞洋, 闫锦龙, 方泽儒, 金旗旗, 薛红新. 基于动态特征增强的三维小目标检测算法[J]. 计算机工程, 2026, 52(4): 264-275.
[6]	成彬, 赵彬兵, 雷华, 何博. 基于双目视觉的钢筋绑扎节点定位方法[J]. 计算机工程, 2026, 52(4): 433-445.
[7]	杨路, 刘俊杰, 余翔. 多尺度信息增强的遥感图像目标检测算法[J]. 计算机工程, 2026, 52(4): 200-213.
[8]	郝友胜, 文贞慧, 冯小溪, 邓泽华, 黄清宝. 基于改进YOLOv8的车辆漆面缺陷检测[J]. 计算机工程, 2026, 52(4): 252-263.
[9]	苏建华, 池云仙, 许云峰, 高凯. 基于注意力模态融合的多模态意图识别[J]. 计算机工程, 2026, 52(3): 234-242.
[10]	曹继卫, 罗飞, 丁炜超. BS-YOLO: 基于BSAM注意力机制和SCConv的小目标检测算法[J]. 计算机工程, 2026, 52(3): 119-127.
[11]	唐克, 魏飞鸣, 李东瀛, 郁文贤. 基于改进YOLOv8的轻量化无人机图像目标检测算法[J]. 计算机工程, 2026, 52(3): 97-106.
[12]	谢斌红, 石宇飞, 张睿, 张英俊. 基于查询引导和语义增强的小样本目标检测方法[J]. 计算机工程, 2026, 52(3): 141-151.
[13]	刘啸宇, 廖志芳, 谈遂, 余志武. 基于堆叠GRU神经网络的桥梁动应变预测[J]. 计算机工程, 2026, 52(3): 441-450.
[14]	许晓阳, 魏伟, 高重阳. 基于改进YOLOv7-tiny的红外船舶目标检测[J]. 计算机工程, 2026, 52(2): 209-220.
[15]	李健浪, 吴新电, 陈灵, 阳波, 唐文胜. 基于4D毫米波雷达与视觉融合的三维目标检测算法[J]. 计算机工程, 2026, 52(2): 299-310.

选择文件类型/文献管理软件名称

选择包含的内容

基于改进DETR的密集行人检测算法研究

Research on Dense Pedestrian Detection Algorithm Based on Improved DETR

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 28

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于改进DETR的密集行人检测算法研究

Research on Dense Pedestrian Detection Algorithm Based on Improved DETR

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 28

相关文章 15

编辑推荐

Metrics

本文评价