Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2021, Vol. 47 ›› Issue (6): 277-283. doi: 10.19678/j.issn.1000-3428.0057892

• Graphics and Image Processing • Previous Articles     Next Articles

Person Re-Identification in Video Based on Spatial-Temporal Attention Region

HU Xiaoqiang, WEI Dan, WANG Ziyang, SHEN Jianglin, REN Hongjuan   

  1. School of Mechanical and Automotive Engineering, Shanghai University of Engineering Science, Shanghai 201620, China
  • Received:2020-03-30 Revised:2020-05-14 Published:2020-06-08
  • Contact: 国家自然科学基金青年科学基金(51805312)。 E-mail:weiweidandan@163.com

基于时空关注区域的视频行人重识别

胡晓强, 魏丹, 王子阳, 沈江霖, 任洪娟   

  1. 上海工程技术大学 机械与汽车工程学院, 上海 201620
  • 作者简介:胡晓强(1996-),男,硕士研究生,主研方向为模式识别、行人重识别;魏丹(通信作者),讲师、博士;王子阳、沈江霖,硕士研究生;任洪娟,副教授、博士。

Abstract: When performing person re-identification task for videos, the traditional local-based methods mainly focus on learning local feature representations in regions with specific predefined semantics, and their learning efficiency and robustness is reduced in complex scenes.This paper combines global and local features to propose a person re-identification method in video based on spatio-temporal attention regions.The feature of attention regions of cross-frame aggregation are fused with the global feature to obtain the video-level feature representation.Then two paths of SlowFast network are used to extract global features and attention region features.In the fast path, the multiple spatial attention model extracts the attention region features, and the attention region features of the same part of all sampling frames are aggregated by the temporal aggregation model.In the slow path, global features are extracted by Convolutional Neural Network(CNN).On this basis, the affinity matrix and the location parameter are used to integrate the attention region feature and the global feature.The average Euclidean distance is used to evaluate the fusion loss, and the triplet loss function is used for end-to-end network training.The experimental results show that the accuracy of this method reaches 93.4% on PRID 2011 data set and 79.5% on mAP on MARS data set, which demonstrates its recognition performance advantage over SeeForst、ASTPN、RQEN and other methods.In addition, it shows excellent robustness to illumination, person pose changes and occlusion.

Key words: person re-identification, attention region, temporal aggregation, global feature, feature fusion

摘要: 在执行视频行人重识别任务时,传统基于局部的方法主要集中于具有特定预定义语义的区域学习局部特征表示,在复杂场景下的学习效率和鲁棒性较差。通过结合全局特征和局部特征提出一种基于时空关注区域的视频行人重识别方法。将跨帧聚合的关注区域特征与全局特征进行融合得到视频级特征表示,利用快慢网络中的两个路径分别提取全局特征和关注区域特征。在快路径中,利用多重空间关注模型提取关注区域特征,利用时间聚合模型聚合所有采样帧相同部位的关注区域特征。在慢路径中,利用卷积神经网络提取全局特征。在此基础上,使用亲和度矩阵和定位参数融合关注区域特征和全局特征。以平均欧氏距离评估融合损失,并将三重损失函数用于端到端网络训练。实验结果表明,该方法在PRID 2011数据集上Rank-1准确率达到93.4%,在MARS数据集上mAP达到79.5%,识别性能优于SeeForst、ASTPN、RQEN等方法,并且对光照、行人姿态变化和遮挡具有很好的鲁棒性。

关键词: 行人重识别, 关注区域, 时间聚合, 全局特征, 特征融合

CLC Number: