作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (5): 250-258. doi: 10.19678/j.issn.1000-3428.0070106

• 计算机视觉与图形图像处理 • 上一篇    下一篇

基于改进DETR的密集行人检测算法研究

宋天泽1,2, 曹从军1,2,*(), 何佳琪1,2, 王旭升1,2, 刘晨煜1,2   

  1. 1. 西安理工大学印刷包装与数字媒体学院, 陕西 西安 710054
    2. 陕西省印刷包装工程技术研究中心, 陕西 西安 710054
  • 收稿日期:2024-07-11 修回日期:2024-11-08 出版日期:2026-05-15 发布日期:2024-12-19
  • 通讯作者: 曹从军
  • 作者简介:

    宋天泽(CCF学生会员), 男, 硕士研究生, 主研方向为目标检测、图像描述

    曹从军(通信作者), 教授、博士

    何佳琪, 硕士研究生

    王旭升, 副教授、博士

    刘晨煜, 硕士研究生

  • 基金资助:
    陕西省重点科研基地项目(2023HBGC-18)

Research on Dense Pedestrian Detection Algorithm Based on Improved DETR

SONG Tianze1,2, CAO Congjun1,2,*(), HE Jiaqi1,2, WANG Xusheng1,2, LIU Chenyu1,2   

  1. 1. Faculty of Printing, Packaging Engineering and Digital Media Technology, Xi'an University of Technology, Xi'an 710054, Shaanxi, China
    2. Printing and Packaging Engineering Technology Research Center of Shaanxi Province, Xi'an 710054, Shaanxi, China
  • Received:2024-07-11 Revised:2024-11-08 Online:2026-05-15 Published:2024-12-19
  • Contact: CAO Congjun

摘要:

密集行人检测是行人检测领域的一大研究热点。针对密集行人检测场景中被遮挡目标及小目标行人易漏检的问题, 提出一种改进DETR的目标检测算法Pe-DETR。采用基于多头自注意力机制的Dino-DETR作为基准模型, 因自注意力机制缺少捕获局部特征的能力, 导致密集行人检测效果较差, 对前馈神经网络(FNN)进行改进, 设计通道注意力深度卷积前馈神经网络DWSEFNN, 使模型可以提取到更多局部细节特征。针对ResNet50骨干网络对重要特征提取效率较低的问题, 采用Swin Transformer-L作为特征提取网络, 提升骨干网络对重要特征的提取能力, 同时使Pe-DETR完全基于注意力机制搭建, 结构中不包含深度卷积结构。针对密集行人场景中目标数量多与DETR检测器中稀疏匹配的矛盾问题, 应用密集不同查询有效应对行人密集的场景, 且不会引入无效的相似查询。在CrowdHuman密集行人检测数据集上的实验结果表明, 所提行人检测算法Pe-DETR相比Dino-DETR算法的平均精度(AP)@0.5提高了3.7百分点, AP提高4.5百分点, 在密集行人检测任务中改进后Pe-DETR算法的准确率明显优于其他端到端模型。

关键词: 行人检测, 目标检测, 深度卷积, 迁移学习, 自注意力机制

Abstract:

Dense pedestrian detection is a research hotspot in the field of pedestrian detection. This study proposes an improved DETR target detection algorithm, Pe-DETR, to address the problem of occluded targets and small target pedestrians being prone to missed detection in dense pedestrian detection scenes. This algorithm uses Dino-DETR, which is based on the multi-head self-attention mechanism, as the benchmark model. However, the self-attention mechanism lacks the ability to capture local features, resulting in poor detection of dense pedestrians. To address this issue, this study enhances Feedforward Neural Network (FNN) and proposes channel attention convolutional feedforward neural network DWSEFNN to extract more local detailed features. In response to the low efficiency of the ResNet50 backbone network in extracting important features, Swin Transformer-L is adopted as the feature extraction network. Simultaneously, Pe-DETR is completely built based on the attention mechanism, and the architecture does not contain a deep convolution structure. To handle the contradictions between the large number of targets in dense pedestrian scenes and sparse matching in the DETR detector, densely different queries are applied to handle pedestrian-dense scenes without introducing invalid similar queries. Experimental results on the CrowdHuman dense pedestrian detection dataset show that, compared with the Dino-DETR algorithm, the proposed pedestrian detection algorithm Pe-DETR achieves an improvement of 3.7 percentage points in Average Precision (AP)@0.5 and an increase of 4.5 percentage points in AP. In dense pedestrian detection tasks, the improved Pe-DETR algorithm demonstrates significantly higher accuracy than other end-to-end models.

Key words: pedestrian detection, object detection, depthwise convolution, transfer learning, self-attention mechanism