Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2025, Vol. 51 ›› Issue (3): 352-361. doi: 10.19678/j.issn.1000-3428.0069071

• Development Research and Engineering Application • Previous Articles     Next Articles

Crowd Counting Network Based on Attention Mechanism and Multiscale Fusion

LUAN Fangjun, GONG Qi, YUAN Shuai*()   

  1. 1. School of Computer Science and Engineering, Shenyang Jianzhu University, Shenyang 110168, Liaoning, China
    2. Liaoning Province Big Data Management and Analysis Laboratory of Urban Construction, Shenyang 110168, Liaoning, China
    3. Shenyang Branch of National Special Computer Engineering Technology Research Center, Shenyang 110168, Liaoning, China
  • Received:2023-12-21 Online:2025-03-15 Published:2024-05-08
  • Contact: YUAN Shuai

基于注意力机制和多尺度融合的人群计数网络

栾方军, 龚琪, 袁帅*()   

  1. 1. 沈阳建筑大学计算机科学与工程学院, 辽宁 沈阳 110168
    2. 辽宁省城市建设大数据管理与分析重点实验室, 辽宁 沈阳 110168
    3. 国家特种计算机工程技术研究中心沈阳分中心, 辽宁 沈阳 110168
  • 通讯作者: 袁帅
  • 基金资助:
    国家自然科学基金(62073227); 辽宁省应用基础研究计划(2023JH2/101300212)

Abstract:

To address the challenges of scale variation and background interference in crowd image counting, a novel network model has been proposed. This model aims to fully utilize multiscale information, to mitigate the impact of background noise. Initially, the network model employs ConvNeXt as the backbone for feature extraction. Subsequently, a Multilevel Feature Fusion Module (MFFM) is introduced to effectively integrate features from different layers, which facilitates the cross-scale fusion of features from various layers within the backbone network. The fused features, encompassing semantic information from different scales, are more adept at addressing the issue of scale variation in crowd counting. Furthermore, a MultiScale Attention Module (MSAM) is designed to better tackle the challenges inherent in crowd counting. This module employs branches with different receptive fields to extract features from various scales, leverages Selective Kernel Channel Attention (SKCA) to mitigate the issue of feature similarity in multicolumn structures, and feeds the attention map generated by the module back into the corresponding scale features, to suppress background interference. On the ShanghaiTechA dataset, the proposed model achieves Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) of 56.1 and 93.9, respectively. On the ShanghaiTechB dataset, these metrics are 6.1 and 10.3, respectively. On the UCF_CC_50 dataset, MAE and RMSE are 174.9 and 252.7, respectively. On the Mall dataset, these metrics are 1.42 and 1.85, respectively. Experimental results on public datasets indicate that the proposed model enhances both accuracy and robustness compared to existing representative methods for crowd counting.

Key words: crowd counting, multi-scale feature fusion, attention mechanism, neural networks, density map

摘要:

为了应对人群图像中尺度变化和背景干扰的问题, 提出一种人群计数网络模型, 旨在充分利用多尺度信息并降低背景噪声的影响。首先采用ConvNeXt作为主干网络, 用于提取特征。其次为了有效融合不同层次的特征, 提出多层次特征融合模块(MFFM), 将主干网络中不同层次的特征进行跨尺度融合, 融合后的特征包含了不同尺度的语义信息, 可以更好地适应人群计数任务中的尺度变化问题。接着为了更好地解决人群计数中存在的挑战, 设计一个多尺度注意力模块(MSAM), 根据不同感受野的分支提取不同尺度的特征, 利用选择性Kernel通道注意力(SKCA)缓解多列结构存在的特征相似问题, 并将模块生成的注意力图反馈到对应的尺度特征中, 以抑制背景的干扰。网络模型在ShanghaiTechA数据集中的平均绝对误差(MAE)和均方根误差(RMSE)分别达到了56.1和93.9;在ShanghaiTechB数据集中的MAE和RMSE分别达到了6.1和10.3;在UCF_CC_50数据集中的MAE和RMSE分别达到了174.9和252.7;在Mall数据集中的MAE和RMSE分别达到了1.42和1.85。在公开数据集上的实验结果表明, 提出的网络模型与现有代表性的人群计数方法相比, 在提升人群计数任务的准确性和鲁棒性方面均取得了明显进展。

关键词: 人群计数, 多尺度特征融合, 注意力机制, 神经网络, 密度图