基于多特征时空推理网络的个体关注目标检测

doi:10.19678/j.issn.1000-3428.0070179

摘要/Abstract

摘要：

现有的个体关注目标检测主要是利用面部信息进行的, 难以应对因人脸部分遮挡、人脸模糊或隐私保护等情况导致面部精细信息缺失的场景, 并且忽视时间信息也会一定程度上影响在视频任务中的效果。基于此, 提出基于多特征融合的时空推理网络, 利用卷积神经网络分别提取个体的头部外观与面部信息、个体姿态信息以及相关场景信息的关键特征, 通过空间推理编码器的注意力机制和自定义的模型训练策略, 学习不同特征的重要程度并降低对单个特征的过分依赖, 实现空间特征加权融合。采用卷积长短时记忆(Conv-LSTM)网络整合视频帧序列中的时空信息, 用于视频任务中的个体关注目标检测工作。实验结果表明, 该方法在GazeFollow数据集和VideoAttentionTarget数据集上整体性能的AUC值分别达到了0.936和0.902。与现有最好的研究方法相比, 该方法在两个数据集上的AUC值分别提高了1.7和3.2百分点, 在个体关注目标检测任务中具有更好的准确性和鲁棒性, 可用于更复杂的现实场景。

关键词: 关注目标检测, 时空推理, 注意力机制, 卷积长短时记忆网络, 多特征融合

Abstract:

Existing individual attention target detection methods mainly rely on facial information. They face challenges in scenarios where fine facial information is missing because of partial occlusion, facial blurring, or privacy protection. Ignoring time information can also affect the effectiveness of these methods in video tasks. This paper proposes a spatiotemporal inference network based on multi-feature fusion for detecting individual attention targets. Convolutional neural networks are utilized to extract the key features from an individual's head appearance and facial information, individual posture information, and related scene information. By leveraging the attention mechanism of the spatial reasoning encoder and through custom model training strategies, the significance of different features is learned, reducing the overreliance on any single feature and achieving weighted integration of spatial features. Convolutional Long Short-Term Memory (Conv-LSTM) networks are employed to integrate spatiotemporal information across video frames, effectively detecting individual attention targets. In experiments on the GazeFollow and VideoAttentionTarget datasets, the proposed method achieves AUC values of 0.936 and 0.902, respectively. Compared with state-of-the-art methods, the overall performance of the proposed method improves by 1.7 and 3.2 percentage points on these two datasets. It has better accuracy and robustness in individual attention target detection tasks, making it suitable for complex real-world scenarios.

Key words: attention target detection, spatiotemporal inference, attention mechanism, Convolutional Long Short-Term Memory (Conv-LSTM) network, multi-feature fusion

杨家豪, 王雷. 基于多特征时空推理网络的个体关注目标检测[J]. 计算机工程, 2026, 52(5): 184-191.

YANG Jiahao, WANG Lei. Individual Attention Target Detection Based on Multi-Feature Spatiotemporal Inference Network[J]. Computer Engineering, 2026, 52(5): 184-191.

https://www.ecice06.com/CN/Y2026/V52/I5/184

图/表 8

图1 提出方法的检测流程

Fig.1 The detection procedure of the proposed method

图2 空间推理流程

Fig.2 The procedure of spatial inference

图3 时间模块

Fig.3 Temporal module

图4 现实场景检测结果可视化

Fig.4 Visualization of detection results in real-world scene

参考文献 28

1	TOMAS H, REYES M, DIONIDO R, et al. GOO: a dataset for gaze object prediction in retail environments[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Washington D. C., USA: IEEE Press, 2021: 3125-3133.
2	FAN L F, CHEN Y X, WEI P, et al. Inferring shared attention in social scene videos[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 6460-6468.
3	SHEIKHI S , ODOBEZ J M . Combining dynamic head pose-gaze mapping with the robot conversational state for attention recognition in human-robot interactions. Pattern Recognition Letters, 2015, 66, 81- 90. doi: 10.1016/j.patrec.2014.10.002
4	苟超, 卓莹, 王康, 等. 眼动跟踪研究进展与展望. 自动化学报, 2022, 48 (5): 1173- 1192. doi: 10.16383/j.aas.c210514
	GOU C , ZHUO Y , WANG K , et al. Progress and prospects of eye tracking research. Journal of Automation, 2022, 48 (5): 1173- 1192. doi: 10.16383/j.aas.c210514
5	周小龙, 汤帆扬, 管秋, 等. 基于3D人眼模型的视线跟踪技术综述. 计算机辅助设计与图形学学报, 2017, 29 (9): 1579- 1589. doi: 10.3969/j.issn.1003-9775.2017.09.001
	ZHOU X L , TANG F Y , GUAN Q , et al. Overview of gaze tracking technology based on 3D human eye model. Journal of Computer Aided Design and Graphics, 2017, 29 (9): 1579- 1589. doi: 10.3969/j.issn.1003-9775.2017.09.001
6	CHENG Y , WANG H , BAO Y , et al. Appearance-based gaze estimation with deep learning: a review and benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (12): 7509- 7528. doi: 10.1109/TPAMI.2024.3393571
7	龚秀锋, 李斌, 邓宏平, 等. 基于标记点检测的视线跟踪注视点估计. 计算机工程, 2011, 37 (6): 289- 290. doi: 10.3969/j.issn.1000-3428.2011.06.100
	GONG X F , LI B , DENG H P , et al. Eye tracking gaze point estimation based on marker point detection. Computer Engineering, 2011, 37 (6): 289- 290. doi: 10.3969/j.issn.1000-3428.2011.06.100
8	PATHIRANA P , SENARATH S , MEEDENIYA D , et al. Eye gaze estimation: a survey on deep learning-based approaches. Expert Systems with Applications, 2022, 199, 116894. doi: 10.1016/j.eswa.2022.116894
9	RECASENS A, KHOSLA A, VONDRICK C, et al. Where are they looking[C]//Proceedings of the Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2015: 28-37.
10	CHONG E, RUIZ N, WANG Y, et al. Connecting gaze, scene, and attention: generalized attention estimation via joint modeling of gaze and scene saliency[C]//Proceedings of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 383-398.
11	LIAN D Z, YU Z H, GAO S H. Believe it or not, we know what you are looking at![C]//Proceedings of the ACCV'18. Berlin, Germany: Springer, 2018: 35-50.
12	ZHAO H , LU M , YAO A B , et al. Learning to draw sight lines. International Journal of Computer Vision, 2020, 128 (5): 1076- 1100. doi: 10.1007/s11263-019-01263-4
13	JIN T L, LIN Z Y, ZHU S Q, et al. Multi-person gaze-following with numerical coordinate regression[C]//Proceedings of the 16th IEEE International Conference on Automatic Face and Gesture Recognition. Washington D. C., USA: IEEE Press, 2021: 1-8.
14	DING H, ZHOU P, CHELLAPPA R. Occlusion-adaptive deep network for robust facial expression recognition[C]//Proceedings of the IEEE International Joint Conference on Biometrics. Washington D. C., USA: IEEE Press, 2020: 1-9.
15	ADJABI I , OUAHABI A , BENZAOUI A , et al. Past, present, and future of face recognition: a review. Electronics, 2020, 9 (8): 1188. doi: 10.3390/electronics9081188
16	DILEEP P, BOLLA B K, SABEESH E. Revisiting facial key point detection: an efficient approach using deep neural networks[C]//Proceedings of the International Conference on Big Data, Machine Learning, and Applications. Berlin, Germany: Springer, 2024: 511-525.
17	CHEN P Y, LIU A H, LIU Y C, et al. Towards scene understanding: unsupervised monocular depth estimation with semantic-aware representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE Press, 2019: 2624-2632.
18	LEE S, RIM J, JEONG B, et al. Human pose estimation in extremely low-light conditions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2023: 704-714.
19	柳大格, 游进国, 耿齐祁. 融合全局与局部语义的跨领域方面词抽取. 计算机工程, 2025, 51 (6): 116- 126. doi: 10.19678/j.issn.1000-3428.0069205
	LIU D G , YOU J G , GENG Q Q . Cross-domain aspect term extraction fusing global and local semantics. Computer Engineering, 2025, 51 (6): 116- 126. doi: 10.19678/j.issn.1000-3428.0069205
20	SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE Press, 2018: 4510-4520.
21	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE Press, 2016: 770-778.
22	RANFTL R , LASINGER K , HAFNER D , et al. Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 44 (3): 1623- 1637. doi: 10.1109/TPAMI.2020.3019967
23	TONINI F, BEYAN C, RICCI E. Multimodal across domains gaze target detection[C]//Proceedings of the 2022 International Conference on Multimodal Interaction. New York, USA: ACM Press, 2022: 420-431.
24	MA S Z , LU H M , LIU J , et al. LAYN: lightweight multi-scale attention YOLOv8 network for small object detection. IEEE Access, 2024, 12, 29294- 29307. doi: 10.1109/ACCESS.2024.3368848
25	YUAN Y, FU R, HUANG L, et al. Hrformer: high-resolution transformer for dense prediction[EB/OL]. [2024-06-14]. https://arxiv.org/abs/2110.09408.
26	CHONG E, WANG Y X, RUIZ N, et al. Detecting attended visual targets in video[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE Press, 2020: 5396-5406.
27	MAXIMOV M, ELEZI I, LEAL-TAIXE L. CIAGAN: conditional identity anonymization generative adversarial networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE Press, 2020: 5447-5456.
28	TONINI F, DALL'ASEN N, BEYAN C, et al. Object-aware gaze target detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2023: 21860-21869.

[1]	张红, 朱思雨, 张玺君, 魏轿云. 基于自适应图卷积优化元图学习的非平稳交通流预测研究[J]. 计算机工程, 2026, 52(5): 456-466.
[2]	宋天泽, 曹从军, 何佳琪, 王旭升, 刘晨煜. 基于改进DETR的密集行人检测算法研究[J]. 计算机工程, 2026, 52(5): 250-258.
[3]	瞿靖鸿, 王中卿, 周国栋. 基于预训练模型的问答知识文本生成[J]. 计算机工程, 2026, 52(5): 326-335.
[4]	吴沛颖, 李晓慧, 王俊峰. 基于上下文感知语言模型的C2流量检测[J]. 计算机工程, 2026, 52(5): 270-280.
[5]	李娇, 范浩东, 洪旭东, 许镇义, 樊旭, 黄俊. 基于标签视觉原型学习的多标签图像分类[J]. 计算机工程, 2026, 52(4): 229-238.
[6]	汤伟博, 方强, 李沛根, 艾龙金, 熊金红, 夏海廷. 基于RSD-YOLO的无人机航拍图像小目标检测[J]. 计算机工程, 2026, 52(4): 214-228.
[7]	尹恒杰, 郑克清, 柯建楠, 董云泉. 基于本地动量加速的非独立同分布联邦学习方法[J]. 计算机工程, 2026, 52(4): 103-110.
[8]	潘理虎, 尹佳莉, 张睿, 谢斌红, 张林梁. 面向交通流预测的全局-局部时空感知模型[J]. 计算机工程, 2026, 52(3): 392-402.
[9]	顾群, 随思懿, 王瑞, 张海, 许天鹏. 基于改进YOLOv8的皮肤黑色素瘤图像分割算法[J]. 计算机工程, 2026, 52(3): 429-440.
[10]	曹继卫, 罗飞, 丁炜超. BS-YOLO: 基于BSAM注意力机制和SCConv的小目标检测算法[J]. 计算机工程, 2026, 52(3): 119-127.
[11]	陈国莲, 冯梓洋, 曹均阔. 基于多模态空间特征融合的网络欺凌检测研究[J]. 计算机工程, 2026, 52(3): 255-263.
[12]	吴雪松, 陈媛媛, 周涛. 基于多尺度金字塔池化的自适应无参考图像质量评价[J]. 计算机工程, 2026, 52(3): 107-118.
[13]	刘啸宇, 廖志芳, 谈遂, 余志武. 基于堆叠GRU神经网络的桥梁动应变预测[J]. 计算机工程, 2026, 52(3): 441-450.
[14]	苏建华, 池云仙, 许云峰, 高凯. 基于注意力模态融合的多模态意图识别[J]. 计算机工程, 2026, 52(3): 234-242.
[15]	许晓阳, 魏伟, 高重阳. 基于改进YOLOv7-tiny的红外船舶目标检测[J]. 计算机工程, 2026, 52(2): 209-220.

选择文件类型/文献管理软件名称

选择包含的内容