基于多层次自注意力网络的人脸特征点检测

doi:10.19678/j.issn.1000-3428.0066927

摘要/Abstract

摘要：

人脸特征点检测是人脸图像处理的关键步骤之一，常用检测方法是基于深度神经网络的坐标回归方法，具有处理速度快的优点，但是用于回归的高层次网络特征丢失空间结构信息，且缺乏细粒度表征能力，导致检测精度降低。针对该问题，提出一种基于多层次自注意力网络的人脸关键点检测算法。为提取更具有细粒度表征能力的图像语义特征，构建基于自注意力机制的多层次特征融合模块，实现高层次高语义信息特征和低层次高空间信息特征的跨层次特征融合。在此基础上，设计一种多任务学习人脸特征点检测定位与人脸姿态角估计的训练方式，优化网络对人脸整体朝向姿态的估计，以提升特征点检测的准确性。在人脸特征点主流数据集300W和WFLW上的实验结果表明，与SAAT、AnchorFace等方法相比，该方法有效提升网络的检测精度，标准平均误差指标分别为3.23%和4.55%，相较于基线模型降低0.37和0.59个百分点，在WFLW数据集上错误率指标为3.56%，相较于基线模型降低了2.86个百分点，能够提取更具鲁棒性和细粒度的表达特征。

关键词: 人脸特征点检测, 卷积神经网络, 自注意力机制, 特征融合, 多任务学习, 深度学习

Abstract:

Facial landmark detection, a key step in facial image processing, is commonly performed using the coordinate regression method based on deep neural networks, which has the advantage of fast processing speed. However, the high-level network features used for regression lose spatial structural information and lack fine-grained representation ability, leading to a decrease in detection accuracy. Therefore, a facial landmark detection algorithm based on multi-level self-attention network is proposed to address this issue. To extract image semantic features with finer granularity representation ability, a multi-level feature fusion module based on self-attention mechanism is constructed to achieve cross-level fusion of high-level semantic and low-level spatial information features. On this basis, a training method for multi-task learning of facial landmark detection and localization, as well as facial pose angle estimation, is designed to optimize the network estimation of the overall orientation and pose of the face, thereby improving the accuracy of landmark detection. The experimental results on mainstream facial landmark datasets 300W and WFLW show that compared with methods such as SAAT and AnchorFace, the proposed method effectively improves network detection accuracy, achieving a standard average error of 3.23% and 4.55%, respectively, which are 0.37 and 0.59 percentage points lower than the baseline model. The error rate indicator on the WFLW dataset is 3.56%, which is 2.86 percentage points lower than the baseline model, demonstrating that the proposed method can extract more robust and fine-grained expression features.

Key words: facial landmark detection, Convolutional Neural Network(CNN), self-attention mechanism, feature fusion, multi-task learning, deep learning

徐浩宸, 刘满华. 基于多层次自注意力网络的人脸特征点检测[J]. 计算机工程, 2024, 50(2): 239-246.

Haochen XU, Manhua LIU. Facial Landmark Detection Based on Hierarchical Self-Attention Network[J]. Computer Engineering, 2024, 50(2): 239-246.

http://www.ecice06.com/CN/Y2024/V50/I2/239

图/表 9

图1 基于多层次自注意力网络的人脸特征点检测算法流程

Fig.1 Procedure of facial landmark detection algorithm based on hierarchical self-attention network

图2 人脸特征点示意图

Fig.2 Schematic diagram of facial landmark

图3 多层次自注意力网络结构

Fig.3 Structure of hierarchical self-attention network

图4 主干网络结构

Fig.4 Structure of backbone network

图5 不同算法的人脸特征点检测结果可视化对比

Fig.5 Visual results comparison of facial landmark detection using different algorithms

参考文献 27

1	于明, 钟元想, 王岩. 人脸微表情分析方法综述. 计算机工程, 2023, 49(2): 1- 14. doi: 10.19678/j.issn.1000-3428.0065790
	YU M, ZHONG Y X, WANG Y. A survey of facial micro-expression analysis methods. Computer Engineering, 2023, 49(2): 1- 14. doi: 10.19678/j.issn.1000-3428.0065790
2	COOTES T F, EDWARDS G J, TAYLOR C J. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(6): 681- 685. doi: 10.1109/34.927467
3	CRISTINACCE D, COOTES T F. Feature detection and tracking with constrained local models[C]//Proceedings of the British Machine Vision Conference. Edinburgh, UK: British Machine Vision Association, 2006: 1-10.
4	NEWELL A, YANG K Y, DENG J. Stacked Hourglass networks for human pose estimation[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 483-499.
5	DOLLAR P, WELINDER P, PERONA P. Cascaded pose regression[C]//Proceedings of Computer Society Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2010: 1-10.
6	SUN Y, WANG X G, TANG X O. Deep convolutional network cascade for facial point detection[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2013: 1-10.
7	GUO X J, LI S Y, YU J K, et al. PFLD: a practical facial landmark detector[EB/OL]. [2023-01-10]. https://arxiv.org/abs/1902.10859.
8	FENG Z H, KITTLER J, AWAIS M, et al. Wing loss for robust facial landmark localisation with convolutional neural networks[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2018: 2235-2245.
9	ZHANG Z P, LUO P, LOY C C, et al. Facial landmark detection by deep multi-task learning[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2014: 94-108.
10	YANG J, LIU Q S, ZHANG K H. Stacked Hourglass network for robust facial landmark localisation[C]//Proceedings of Conference on Computer Vision and Pattern Recognition Workshops. Washington D. C. , USA: IEEE Press, 2017: 79-87.
11	DENG J K, TRIGEORGIS G, ZHOU Y X, et al. Joint multi-view face alignment in the wild. IEEE Transactions on Image Processing, 2019, 28(7): 3636- 3648. doi: 10.1109/TIP.2019.2899267
12	WU W Y, QIAN C, YANG S, et al. Look at boundary: a boundary-aware face alignment algorithm[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2018: 2129-2138.
13	WANG X Y, BO L F, LI F X. Adaptive Wing loss for robust face alignment via heatmap regression[C]//Proceedings of International Conference on Computer Vision. Washington D. C. , USA: IEEE Press, 2019: 6971-6981.
14	LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2017: 2117-2125.
15	HARIHARAN B, ARBELáEZ P, GIRSHICK R, et al. Hypercolumns for object segmentation and fine-grained localization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2015: 447-456.
16	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2015: 3431-3440.
17	XIE S N, TU Z W. Holistically-nested edge detection[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D. C. , USA: IEEE Press, 2015: 1395-1403.
18	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 6000-6010.
19	ZHANG K P, ZHANG Z P, LI Z F, et al. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 2016, 23(10): 1499- 1503. doi: 10.1109/LSP.2016.2603342
20	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2016: 770-778.
21	HE K M, ZHANG X Y, REN S Q, et al. Identity mappings in deep residual networks[C]//Proceedings of the 14th European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 630-645.
22	SAGONAS C, ANTONAKOS E, TZIMIROPOULOS G, et al. 300 faces in-the-wild challenge: database and results. Image and Vision Computing, 2016, 47, 3- 18. doi: 10.1016/j.imavis.2016.01.002
23	XU Z X, LI B H, YUAN Y, et al. AnchorFace: an anchor-based facial landmark detector across large poses[C]//Proceedings of the AAAI Conference on Artificial Intelligence. [S. l.]: AAAI Press, 2021: 3092-3100.
24	ZHAO Y, LIU Y F, SHEN C H, et al. MobileFAN: Transferring deep hidden representation for face alignment. Pattern Recognition, 2020, 100, 107114. doi: 10.1016/j.patcog.2019.107114
25	SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2019: 5693-5703.
26	YANG H L, YU J Y, CHENG P J, et al. LDDMM-Face: large deformation diffeomorphic metric learning for flexible and consistent face alignment[EB/OL]. [2023-01-10]. https://arxiv.org/abs/2108.00690.
27	ZHU C C, LI X Q, LI J D, et al. Improving robustness of facial landmark detection by defending against adversarial attacks[C]//Proceedings of International Conference on Computer Vision. Washington D. C. , USA: IEEE Press, 2021: 11751-11760.

[1]	王正家, 胡飞飞, 张成娟, 雷卓, 何涛. 引入轻量级Transformer的自适应窗口立体匹配算法[J]. 计算机工程, 2024, 50(2): 256-265.
[2]	郭祥振, 李思潼, 卢锐, 郭森, 崔学荣, 杨钢. 基于多任务联合注意力的结肠息肉分割网络[J]. 计算机工程, 2024, 50(2): 327-336.
[3]	郑晨俊, 曾艳, 袁俊峰, 张纪林, 王鑫, 韩猛. 基于联邦学习的船舶AIS轨迹预测算法[J]. 计算机工程, 2024, 50(2): 298-307.
[4]	李倩, 魏伟波, 杨光宇, 宋金涛, 孙璐, 潘振宽. 基于先验驱动深度神经网络的泊松去噪变分模型[J]. 计算机工程, 2024, 50(2): 273-280.
[5]	曾嘉忻, 张卫明, 张荣. 基于后门的鲁棒后向模型水印方法[J]. 计算机工程, 2024, 50(2): 132-139.
[6]	陈虹, 王瀚文, 金海波. 融合改进自编码器和残差网络的入侵检测模型[J]. 计算机工程, 2024, 50(2): 188-195.
[7]	李伟健, 胡慧君. 基于潜在特征增强网络的视频描述生成方法[J]. 计算机工程, 2024, 50(2): 266-272.
[8]	丁国辉, 刘宇琪, 王言开, 耿施展, 姜天昊. 基于翻转网络的低相关性序列数据预测研究[J]. 计算机工程, 2024, 50(2): 78-90.
[9]	安峰民, 张冰冰, 董微, 张建新. 面向视频行为识别深度模型的数据预处理方法[J]. 计算机工程, 2024, 50(2): 281-287.
[10]	白尚旺, 王梦瑶, 胡静, 陈志泊. 多区域注意力的细粒度图像分类网络[J]. 计算机工程, 2024, 50(1): 271-278.
[11]	杨瑞君, 秦晋京, 程燕. 基于生成对抗网络的自然场景低照度增强模型[J]. 计算机工程, 2024, 50(1): 279-288.
[12]	曹广硕, 黄瑞章, 陈艳平, 秦永彬. 基于多模态学习的乳腺癌生存预测研究[J]. 计算机工程, 2024, 50(1): 296-305.
[13]	圣文顺, 余熊峰, 林佳燕, 陈欣. 融合注意力与特征金字塔的小尺度目标检测算法[J]. 计算机工程, 2024, 50(1): 242-250.
[14]	祝冰艳, 陈志华, 盛斌. 基于感知增强Swin Transformer的遥感图像检测[J]. 计算机工程, 2024, 50(1): 216-223.
[15]	蒋心璐, 陈天恩, 王聪, 赵春江. 大田环境下的农业害虫图像小目标检测算法[J]. 计算机工程, 2024, 50(1): 232-241.

选择文件类型/文献管理软件名称

选择包含的内容