作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (2): 312-321. doi: 10.19678/j.issn.1000-3428.0068905

• 图形图像处理 • 上一篇    下一篇

基于改进Vision Transformer的局部光照一致性估计

王杨1,*(), 宋世佳1, 王鹤琴1, 袁振羽1, 赵立军2, 吴其林1   

  1. 1. 安徽师范大学计算机与信息学院, 安徽 芜湖 241000
    2. 长三角哈特机器人产业技术研究院, 安徽 芜湖 241000
  • 收稿日期:2023-11-27 出版日期:2025-02-15 发布日期:2024-05-21
  • 通讯作者: 王杨
  • 基金资助:
    国家自然科学基金(61871412); 安徽省自然科学基金重点项目(KJ2019A0938); 安徽省自然科学基金重点项目(KJ2021A1314); 安徽省自然科学基金重点项目(KJ2019A0979); 安徽高校自然科学重点项目(2022AH052899); 安徽高校自然科学重点项目(KJ2019A0979); 安徽高校自然科学重点项目(KJ2019A0511); 安徽高校自然科学重点项目(2023AH052757); 机器视觉检测安徽省重点实验室开放课题(KLMVI-2023-HIT-11); 安徽省高校学科(专业)拔尖人才学术项目(gxbjZD2022147)

Estimation of Local Illumination Consistency Based on Improved Vision Transformer

WANG Yang1,*(), SONG Shijia1, WANG Heqin1, YUAN Zhenyu1, ZHAO Lijun2, WU Qilin1   

  1. 1. School of Computer and Information, Anhui Normal University, Wuhu 241000, Anhui, China
    2. Yangtze River Delta Region Hart Robotics Industry Technology Research Institute, Wuhu 241000, Anhui, China
  • Received:2023-11-27 Online:2025-02-15 Published:2024-05-21
  • Contact: WANG Yang

摘要:

光照一致性是增强现实(AR)系统中实现虚实有机融合的关键因素之一。由于拍摄视角的局限性和场景光照的复杂性, 开发者在估计全景照明信息时通常忽略局部光照一致性, 从而影响最终的渲染效果。为解决这一问题, 提出一种基于改进视觉Transformer(ViT)结构的局部光照一致性估计框架(ViTLight)。首先利用ViT编码器提取特征向量并计算回归球面谐波(SH)系数, 进而恢复光照信息; 其次改进ViT编码器结构, 引入多头自注意力交互机制, 采用卷积运算引导注意力头之间相互联系, 在此基础上增加局部感知模块, 扫描每个图像分块并对局部像素进行加权求和, 捕捉区域内的特定特征, 有助于平衡全局上下文特征和局部光照信息, 提高光照估计的精度。在公开数据集上对比主流特征提取网络和4种经典光照估计框架, 实验和分析结果表明, ViTLight在图像渲染准确率方面高于现有框架, 其均方根误差(RMSE)和结构相异性(DSSIM)指标分别为0.129 6和0.042 6, 验证了该框架的有效性与正确性。

关键词: 增强现实, 光照估计, 球面谐波系数, 视觉Transformer, 多头自注意力

Abstract:

Illumination consistency is a key factor in achieving the organic fusion of virtual and real elements in Augmented Reality (AR) systems. Owing to the constraints of capture perspectives and the complexity of scene illumination, developers often overlook local illumination consistency when estimating panoramic lighting information, thereby affecting the final rendering quality. To address this issue, this study proposes a local illumination consistency estimation framework, ViTLight, based on an improved Vision Transformer (ViT) structure. First, the framework uses a ViT encoder to extract feature vectors and calculate regression Spherical Harmonic (SH) coefficients, then recovers illumination information. Second, the ViT encoder structure is enhanced by introducing a multi-head self-attention interaction mechanism. Convolution operation guides the interplay between attention heads. Additionally, a local perception module is integrated to actively scan each image block and perform weighted summation on local pixels to capture specific features within regions. This proactive approach balances global contextual features and local illumination information, ultimately improving the precision of illumination estimation. The mainstream feature extraction network and four classical illumination estimation frameworks are compared on public datasets. The experimental results and analysis indicate that ViTLight is superior to existing frameworks in terms of image rendering accuracy, and its Root Mean Square Error (RMSE) and Structural Dissimilarity (DSSIM) index reach 0.129 6 and 0.042 6, respectively, which verifies its effectiveness and correctness.

Key words: Augmented Reality (AR), illumination estimation, Spherical Harmonics (SH) coefficient, Vision Transformer (ViT), multi-head self-attention