作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (5): 184-191. doi: 10.19678/j.issn.1000-3428.0070179

• 计算机视觉与图形图像处理 • 上一篇    下一篇

基于多特征时空推理网络的个体关注目标检测

杨家豪, 王雷*()   

  1. 中国科学技术大学信息科学技术学院, 安徽 合肥 230026
  • 收稿日期:2024-07-25 修回日期:2024-10-14 出版日期:2026-05-15 发布日期:2024-12-23
  • 通讯作者: 王雷
  • 作者简介:

    杨家豪, 男, 硕士研究生, 主研方向为深度学习、人工智能

    王雷(通信作者), 副教授、博士

  • 基金资助:
    高技术创新特区项目(20-163-14-LZ-001-004-01)

Individual Attention Target Detection Based on Multi-Feature Spatiotemporal Inference Network

YANG Jiahao, WANG Lei*()   

  1. School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, Anhui, China
  • Received:2024-07-25 Revised:2024-10-14 Online:2026-05-15 Published:2024-12-23
  • Contact: WANG Lei

摘要:

现有的个体关注目标检测主要是利用面部信息进行的, 难以应对因人脸部分遮挡、人脸模糊或隐私保护等情况导致面部精细信息缺失的场景, 并且忽视时间信息也会一定程度上影响在视频任务中的效果。基于此, 提出基于多特征融合的时空推理网络, 利用卷积神经网络分别提取个体的头部外观与面部信息、个体姿态信息以及相关场景信息的关键特征, 通过空间推理编码器的注意力机制和自定义的模型训练策略, 学习不同特征的重要程度并降低对单个特征的过分依赖, 实现空间特征加权融合。采用卷积长短时记忆(Conv-LSTM)网络整合视频帧序列中的时空信息, 用于视频任务中的个体关注目标检测工作。实验结果表明, 该方法在GazeFollow数据集和VideoAttentionTarget数据集上整体性能的AUC值分别达到了0.936和0.902。与现有最好的研究方法相比, 该方法在两个数据集上的AUC值分别提高了1.7和3.2百分点, 在个体关注目标检测任务中具有更好的准确性和鲁棒性, 可用于更复杂的现实场景。

关键词: 关注目标检测, 时空推理, 注意力机制, 卷积长短时记忆网络, 多特征融合

Abstract:

Existing individual attention target detection methods mainly rely on facial information. They face challenges in scenarios where fine facial information is missing because of partial occlusion, facial blurring, or privacy protection. Ignoring time information can also affect the effectiveness of these methods in video tasks. This paper proposes a spatiotemporal inference network based on multi-feature fusion for detecting individual attention targets. Convolutional neural networks are utilized to extract the key features from an individual's head appearance and facial information, individual posture information, and related scene information. By leveraging the attention mechanism of the spatial reasoning encoder and through custom model training strategies, the significance of different features is learned, reducing the overreliance on any single feature and achieving weighted integration of spatial features. Convolutional Long Short-Term Memory (Conv-LSTM) networks are employed to integrate spatiotemporal information across video frames, effectively detecting individual attention targets. In experiments on the GazeFollow and VideoAttentionTarget datasets, the proposed method achieves AUC values of 0.936 and 0.902, respectively. Compared with state-of-the-art methods, the overall performance of the proposed method improves by 1.7 and 3.2 percentage points on these two datasets. It has better accuracy and robustness in individual attention target detection tasks, making it suitable for complex real-world scenarios.

Key words: attention target detection, spatiotemporal inference, attention mechanism, Convolutional Long Short-Term Memory (Conv-LSTM) network, multi-feature fusion