作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (6): 288-295. doi: 10.19678/j.issn.1000-3428.0070600

• 多模态与信息融合 • 上一篇    下一篇

基于多模态融合的360°图像质量与美学评估方法

崔爽锌, 卢搏, 张明月, 赵一汎, 王子铭, 刘新宇, 陈程立诏*()   

  1. 中国石油大学(华东)青岛软件学院、计算机科学与技术学院, 山东 青岛, 266580
  • 收稿日期:2024-11-11 修回日期:2025-01-20 出版日期:2026-06-15 发布日期:2025-03-13
  • 通讯作者: 陈程立诏
  • 作者简介:

    崔爽锌,女,硕士研究生,主研方向为计算机视觉

    卢搏,硕士研究生

    张明月,硕士研究生

    赵一汎,硕士研究生

    王子铭,硕士研究生

    刘新宇,硕士研究生

    陈程立诏(通信作者),教授、博士

  • 基金资助:
    山东省自然科学基金优秀青年科学基金项目(ZR2024YQ071); 国家自然科学基金(62172246); 山东省高校青年创新科技支持计划(2021KJ062)

Multimodal Fusion-Based Method for 360° Image Quality and Aesthetic Assessment

CUI Shuangxin, LU Bo, ZHANG Mingyue, ZHAO Yifan, WANG Ziming, LIU Xinyu, CHEN Chenglizhao*()   

  1. Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, Shandong, China
  • Received:2024-11-11 Revised:2025-01-20 Online:2026-06-15 Published:2025-03-13
  • Contact: CHEN Chenglizhao

摘要:

尽管现有的360°图像质量评估方法在设计启发式评估模型方面取得了许多进展, 但由于未充分考虑人类观看360°图像的特性, 因此评估结果与人类主观感知的质量仍存在显著差异。针对现有方法的不足, 提出一种结合图像质量与美学特征的全景图像质量评估方法, 旨在从更符合人类感知的角度对图像进行全面评估, 并准确反映全景图像的整体质量。该方法包括两个主要阶段: 首先, 通过多模态大语言模型解析图像数据集, 生成包含图像质量和美学特征的文本描述, 从而构建图像-文本对数据集, 这一过程将图像质量与美学评估两个独立任务结合, 有助于模型对图像进行更加全面的理解; 其次, 设计了一个双流架构的多模态质量感知模型, 有效融合文本特征与图像特征, 深入挖掘图像中的多模态信息。在传统L2范式损失函数的基础上, 加入了Triplet Loss, 以更好地反映样本之间的主观质量差异。在基准数据集CVIQD和OIQA上, 该方法在斯皮尔曼等级相关系数(SRCC)、皮尔逊线性相关系数(PLCC)和均方根误差(RMSE)指标上均展现出较优的图像质量评估性能, 优于其他现有的最优方法。

关键词: 质量评估, 全景图像, 多模态融合, 大语言模型, 质量感知

Abstract:

Although significant progress has been made in the design of heuristic models for 360° image quality assessment, existing methods still exhibit a substantial gap from human subjective perception owing to insufficient consideration of how humans view 360° images. To address this limitation, this paper proposes a panoramic image quality assessment method that integrates both image quality and aesthetic features. This method aims to provide a comprehensive evaluation of images from a human perception perspective and accurately reflect the overall quality of the panoramic images. The approach consists of two main stages. First, a multimodal large language model is used to analyze an image dataset and generate textual descriptions that encapsulate both image quality and aesthetic features, thus constructing an image-text pair dataset. This process combines image quality and aesthetic evaluation, enabling the model to gain a holistic understanding of the images. In the second stage, a dual-stream multimodal quality perception model is designed to effectively fuse textual and visual features and thoroughly explore the multimodal information of the image. Additionally, Triplet Loss is incorporated on top of the traditional L2 loss function to better capture the subjective quality differences between samples. Experimental results on the CVIQD and OIQA benchmark datasets demonstrate that the proposed algorithm achieves significant improvements in image quality assessment performance across the Spearman's Rank Correlation Coefficient (SRCC), Pearson Linear Correlation Coefficient (PLCC), and Root Mean Square Error (RMSE) metrics, outperforming other state-of-the-art methods.

Key words: quality assessment, panoramic image, multimodal fusion, large language model, quality perception