作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (2): 239-246. doi: 10.19678/j.issn.1000-3428.0066927

• 图形图像处理 • 上一篇    下一篇

基于多层次自注意力网络的人脸特征点检测

徐浩宸*(), 刘满华   

  1. 上海交通大学电子信息与电气工程学院, 上海 200240
  • 收稿日期:2023-02-13 出版日期:2024-02-15 发布日期:2023-04-28
  • 通讯作者: 徐浩宸
  • 基金资助:
    国家自然科学基金面上项目(62171283); 上海市自然科学基金(20ZR1426300); 上海市市级科技重大专项(2021SHZDZX0102)

Facial Landmark Detection Based on Hierarchical Self-Attention Network

Haochen XU*(), Manhua LIU   

  1. School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
  • Received:2023-02-13 Online:2024-02-15 Published:2023-04-28
  • Contact: Haochen XU

摘要:

人脸特征点检测是人脸图像处理的关键步骤之一,常用检测方法是基于深度神经网络的坐标回归方法,具有处理速度快的优点,但是用于回归的高层次网络特征丢失空间结构信息,且缺乏细粒度表征能力,导致检测精度降低。针对该问题,提出一种基于多层次自注意力网络的人脸关键点检测算法。为提取更具有细粒度表征能力的图像语义特征,构建基于自注意力机制的多层次特征融合模块,实现高层次高语义信息特征和低层次高空间信息特征的跨层次特征融合。在此基础上,设计一种多任务学习人脸特征点检测定位与人脸姿态角估计的训练方式,优化网络对人脸整体朝向姿态的估计,以提升特征点检测的准确性。在人脸特征点主流数据集300W和WFLW上的实验结果表明,与SAAT、AnchorFace等方法相比,该方法有效提升网络的检测精度,标准平均误差指标分别为3.23%和4.55%,相较于基线模型降低0.37和0.59个百分点,在WFLW数据集上错误率指标为3.56%,相较于基线模型降低了2.86个百分点,能够提取更具鲁棒性和细粒度的表达特征。

关键词: 人脸特征点检测, 卷积神经网络, 自注意力机制, 特征融合, 多任务学习, 深度学习

Abstract:

Facial landmark detection, a key step in facial image processing, is commonly performed using the coordinate regression method based on deep neural networks, which has the advantage of fast processing speed. However, the high-level network features used for regression lose spatial structural information and lack fine-grained representation ability, leading to a decrease in detection accuracy. Therefore, a facial landmark detection algorithm based on multi-level self-attention network is proposed to address this issue. To extract image semantic features with finer granularity representation ability, a multi-level feature fusion module based on self-attention mechanism is constructed to achieve cross-level fusion of high-level semantic and low-level spatial information features. On this basis, a training method for multi-task learning of facial landmark detection and localization, as well as facial pose angle estimation, is designed to optimize the network estimation of the overall orientation and pose of the face, thereby improving the accuracy of landmark detection. The experimental results on mainstream facial landmark datasets 300W and WFLW show that compared with methods such as SAAT and AnchorFace, the proposed method effectively improves network detection accuracy, achieving a standard average error of 3.23% and 4.55%, respectively, which are 0.37 and 0.59 percentage points lower than the baseline model. The error rate indicator on the WFLW dataset is 3.56%, which is 2.86 percentage points lower than the baseline model, demonstrating that the proposed method can extract more robust and fine-grained expression features.

Key words: facial landmark detection, Convolutional Neural Network(CNN), self-attention mechanism, feature fusion, multi-task learning, deep learning