基于多模态融合的360°图像质量与美学评估方法

doi:10.19678/j.issn.1000-3428.0070600

计算机工程 ›› 2026, Vol. 52 ›› Issue (6): 288-295. doi: 10.19678/j.issn.1000-3428.0070600

基于多模态融合的360°图像质量与美学评估方法

崔爽锌, 卢搏, 张明月, 赵一汎, 王子铭, 刘新宇, 陈程立诏*()

中国石油大学(华东)青岛软件学院、计算机科学与技术学院, 山东青岛, 266580

收稿日期:2024-11-11 修回日期:2025-01-20 出版日期:2026-06-15 发布日期:2025-03-13
通讯作者: 陈程立诏
作者简介:
崔爽锌，女，硕士研究生，主研方向为计算机视觉
卢搏，硕士研究生
张明月，硕士研究生
赵一汎，硕士研究生
王子铭，硕士研究生
刘新宇，硕士研究生
陈程立诏(通信作者)，教授、博士
基金资助:
山东省自然科学基金优秀青年科学基金项目(ZR2024YQ071); 国家自然科学基金(62172246); 山东省高校青年创新科技支持计划(2021KJ062)

Multimodal Fusion-Based Method for 360° Image Quality and Aesthetic Assessment

CUI Shuangxin, LU Bo, ZHANG Mingyue, ZHAO Yifan, WANG Ziming, LIU Xinyu, CHEN Chenglizhao*()

Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, Shandong, China

Received:2024-11-11 Revised:2025-01-20 Online:2026-06-15 Published:2025-03-13
Contact: CHEN Chenglizhao

摘要/Abstract

摘要：

尽管现有的360°图像质量评估方法在设计启发式评估模型方面取得了许多进展, 但由于未充分考虑人类观看360°图像的特性, 因此评估结果与人类主观感知的质量仍存在显著差异。针对现有方法的不足, 提出一种结合图像质量与美学特征的全景图像质量评估方法, 旨在从更符合人类感知的角度对图像进行全面评估, 并准确反映全景图像的整体质量。该方法包括两个主要阶段: 首先, 通过多模态大语言模型解析图像数据集, 生成包含图像质量和美学特征的文本描述, 从而构建图像-文本对数据集, 这一过程将图像质量与美学评估两个独立任务结合, 有助于模型对图像进行更加全面的理解; 其次, 设计了一个双流架构的多模态质量感知模型, 有效融合文本特征与图像特征, 深入挖掘图像中的多模态信息。在传统L2范式损失函数的基础上, 加入了Triplet Loss, 以更好地反映样本之间的主观质量差异。在基准数据集CVIQD和OIQA上, 该方法在斯皮尔曼等级相关系数(SRCC)、皮尔逊线性相关系数(PLCC)和均方根误差(RMSE)指标上均展现出较优的图像质量评估性能, 优于其他现有的最优方法。

关键词: 质量评估, 全景图像, 多模态融合, 大语言模型, 质量感知

Abstract:

Although significant progress has been made in the design of heuristic models for 360° image quality assessment, existing methods still exhibit a substantial gap from human subjective perception owing to insufficient consideration of how humans view 360° images. To address this limitation, this paper proposes a panoramic image quality assessment method that integrates both image quality and aesthetic features. This method aims to provide a comprehensive evaluation of images from a human perception perspective and accurately reflect the overall quality of the panoramic images. The approach consists of two main stages. First, a multimodal large language model is used to analyze an image dataset and generate textual descriptions that encapsulate both image quality and aesthetic features, thus constructing an image-text pair dataset. This process combines image quality and aesthetic evaluation, enabling the model to gain a holistic understanding of the images. In the second stage, a dual-stream multimodal quality perception model is designed to effectively fuse textual and visual features and thoroughly explore the multimodal information of the image. Additionally, Triplet Loss is incorporated on top of the traditional L2 loss function to better capture the subjective quality differences between samples. Experimental results on the CVIQD and OIQA benchmark datasets demonstrate that the proposed algorithm achieves significant improvements in image quality assessment performance across the Spearman's Rank Correlation Coefficient (SRCC), Pearson Linear Correlation Coefficient (PLCC), and Root Mean Square Error (RMSE) metrics, outperforming other state-of-the-art methods.

Key words: quality assessment, panoramic image, multimodal fusion, large language model, quality perception

崔爽锌, 卢搏, 张明月, 赵一汎, 王子铭, 刘新宇, 陈程立诏. 基于多模态融合的360°图像质量与美学评估方法[J]. 计算机工程, 2026, 52(6): 288-295.

CUI Shuangxin, LU Bo, ZHANG Mingyue, ZHAO Yifan, WANG Ziming, LIU Xinyu, CHEN Chenglizhao. Multimodal Fusion-Based Method for 360° Image Quality and Aesthetic Assessment[J]. Computer Engineering, 2026, 52(6): 288-295.

https://www.ecice06.com/CN/Y2026/V52/I6/288

图/表 6

图1 本文算法的整体流程

Fig.1 Overall procedure of the algorithm in this paper

图2 不同方法的图像质量评分可视化结果

Fig.2 Visual results of image quality scoring among different methods

图3 文本信息的消融实验结果

Fig.3 Ablation experimental results of text information

参考文献 31

1	朱孟栩, 张文豪, 李国洪, 等. 基于卷积神经网络的高分六号卫星多光谱图像压缩. 计算机工程, 2023, 49 (9): 287- 294. doi: 10.19678/j.issn.1000-3428.0064845
	ZHU M X , ZHANG W H , LI G H , et al. GF-6 multispectral image compression based on convolutional neural network. Computer Engineering, 2023, 49 (9): 287- 294. doi: 10.19678/j.issn.1000-3428.0064845
2	TZOU K H . Progressive image transmission: a review and comparison of techniques. Optical Engineering, 1987, 26 (7): 267581. doi: 10.1117/12.7974121
3	YU M, LAKSHMAN H, GIROD B. A framework to evaluate omnidirectional video coding schemes[C]//Proceedings of the IEEE International Symposium on Mixed and Augmented Reality. Washington D. C., USA: IEEE Press, 2015: 31-36.
4	SUN Y, LU A, YU L. AHG8: WS-PSNR for 360 video objective quality evaluation[C]//Proceedings of the 4th Conference of ITUT SG16 WP3 Joint Video Exploration Team. Chengdu, China: [s. n], 2016: 13-20.
5	ZAKHARCHENKO V , CHOI K P , PARK J H . Quality metric for spherical panoramic video. Optics and Photonics for Information Processing X, 2016, 9970, 99700C.
6	SHENG J J , ZHANG D W , CHEN J X , et al. Towards universal and sparse adversarial examples for visual object tracking. Applied Soft Computing, 2024, 153, 111252. doi: 10.1016/j.asoc.2024.111252
7	卞鹏程, 郑忠龙, 李明禄, 等. 基于注意力融合网络的视频超分辨率重建. 计算机应用, 2021, 41 (4): 1012- 1019.
	BIAN P C , ZHENG Z L , LI M L , et al. Attention fusion network based video super-resolution reconstruction. Journal of Computer Applications, 2021, 41 (4): 1012- 1019.
8	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2024-10-01]. https://arxiv.org/pdf/2103.00020.
9	ZHANG B C, ZHANG P, DONG X Y, et al. Long-CLIP: unlocking the long-text capability of CLIP[EB/OL]. [2024-10-01]. https://arxiv.org/pdf/2403.15378.
10	GOLESTANEH S A, DADSETAN S, KITANI K M. No-reference image quality assessment via transformers, relative ranking, and self-consistency[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Washington D. C., USA: IEEE Press, 2022: 3989-3999.
11	MITTAL A , SOUNDARARAJAN R , BOVIK A C . Making a "completely blind" image quality analyzer. IEEE Signal Processing Letters, 2013, 20 (3): 209- 212. doi: 10.1109/LSP.2012.2227726
12	KE J J, WANG Q F, WANG Y L, et al. MUSIQ: multi-scale image quality transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D. C., USA: IEEE Press, 2022: 5128-5137.
13	ZHU H C, LI L D, WU J J, et al. MetaIQA: deep meta-learning for no-reference image quality assessment[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2020: 14131-14140.
14	FU J, HOU C, ZHOU W, et al. Adaptive hypergraph convolutional network for no-reference 360-degree image quality assessment[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York, USA: ACM Press, 2022: 961-969.
15	WU T , SHI S , CAI H , et al. Assessor360: multi-sequence network for blind omnidirectional image quality assessment. Advances in Neural Information Processing Systems, 2023, 36, 64957- 64970.
16	XU J H , ZHOU W , CHEN Z B . Blind omnidirectional image quality assessment with viewport oriented graph convolutional networks. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31 (5): 1724- 1737. doi: 10.1109/TCSVT.2020.3015186
17	ZHANG W X , MA K D , YAN J , et al. Blind image quality assessment using a deep bilinear convolutional neural network. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30 (1): 36- 47. doi: 10.1109/TCSVT.2018.2886771
18	SUN W , MIN X K , ZHAI G T , et al. MC360IQA: a multi-channel CNN for blind 360-degree image quality assessment. IEEE Journal of Selected Topics in Signal Processing, 2020, 14 (1): 64- 77. doi: 10.1109/JSTSP.2019.2955024
19	李敬灿, 肖萃林, 覃晓婷, 等. 基于大语言模型与语义增强的文本关系抽取算法. 计算机工程, 2024, 50 (4): 87- 94. doi: 10.19678/j.issn.1000-3428.0068501
	LI J C , XIAO C L , QIN X T , et al. Text-relation-extraction algorithm based on large-language model and semantic enhancement. Computer Engineering, 2024, 50 (4): 87- 94. doi: 10.19678/j.issn.1000-3428.0068501
20	ZHANG Y C, MA Z Q, GAO X F, et al. Groundhog grounding large language models to holistic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2024: 14227-14238.
21	LIU H , LI C , WU Q , et al. Visual instruction tuning. Advances in neural information processing systems, 2023, 36, 34892- 34916.
22	ZHU D Y, CHEN J, SHEN X Q, et al. MiniGPT-4: enhancing vision-language understanding with advanced large language models[EB/OL]. [2024-10-01]. https://arxiv.org/pdf/2304.10592.
23	CAI M, LIU H T, MUSTIKOVELA S K, et al. ViP-LLaVA: making large multimodal models understand arbitrary visual prompts[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2024: 12914-12923.
24	SUN W, GU K, MA S W, et al. A large-scale compressed 360-degree spherical image database: from subjective quality evaluation to objective model comparison[C]//Proceedings of the IEEE 20th International Workshop on Multimedia Signal Processing (MMSP). Washington D. C., USA: IEEE Press, 2018: 1-6.
25	FANG Y M, HUANG L P, YAN J B, et al. Perceptual quality assessment of omnidirectional images[C]//Proceedings of the AAAI Conference on Artificial Intelligence. California, USA: AAAI Press, 2022, 36(1): 580-588.
26	WANG Z , BOVIK A C , SHEIKH H R , et al. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 2004, 13 (4): 600- 612. doi: 10.1109/TIP.2003.819861
27	WANG Z, SIMONCELLI E P, BOVIK A C. Multiscale structural similarity for image quality assessment[C]//Proceedings of the 37th Asilomar Conference on Signals, Systems & Computers. Washington D. C., USA: IEEE Press, 2004: 1398-1402.
28	ZHANG W X, ZHAI G T, WEI Y, et al. Blind image quality assessment via vision-language correspondence: a multitask learning perspective[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2023: 14071-14081.
29	MITTAL A , MOORTHY A K , BOVIK A C . No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing, 2012, 21 (12): 4695- 4708. doi: 10.1109/TIP.2012.2214050
30	AGNOLUCCI L, GALTERI L, BERTINI M, et al. ARNIQA: learning distortion manifold for image quality assessment[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Washington D. C., USA: IEEE Press, 2024: 188-197.
31	SUI X J, ZHU H W, LIU X L, et al. Perceptual quality assessment of 360°images based on generative scanpath representation[EB/OL]. [2024-10-01]. https://arxiv.org/pdf/2309.03472.

[1]	王永旗, 王雷. 基于跨模态增强与时间步门控的多模态情感识别[J]. 计算机工程, 2026, 52(6): 258-267.
[2]	李佳坤, 刘艳青, 杜方, 余振华, 冯宇, 王慧, 霍显浩. BrainTumorLLM: 面向脑肿瘤诊疗的大语言模型优化与评估[J]. 计算机工程, 2026, 52(5): 349-359.
[3]	余滔, 董军. 多智能体博弈环境下的大语言模型协同决策研究[J]. 计算机工程, 2026, 52(5): 336-348.
[4]	许旻辰, 屈丹, 司念文, 彭思思, 陈雅淇. 社交媒体虚假信息检测技术研究综述[J]. 计算机工程, 2026, 52(5): 60-80.
[5]	李江涛, 马礼, 李阳. 基于大小模型融合的医疗数据分类方法[J]. 计算机工程, 2026, 52(5): 360-370.
[6]	苏建华, 池云仙, 许云峰, 高凯. 基于注意力模态融合的多模态意图识别[J]. 计算机工程, 2026, 52(3): 234-242.
[7]	张添植, 周刚, 张爽, 陈静, 黄宁博, 吴皓. 针对图文模态间实体对齐的目标实体情感分类[J]. 计算机工程, 2026, 52(3): 222-233.
[8]	张成辉, 罗景, 涂新辉, 陈雨霖. 基于大语言模型的语料库查询自动生成方法[J]. 计算机工程, 2026, 52(2): 404-412.
[9]	王利民, 朱光辉, 吴涛. 大模型技术演进：世界模型让人工智能从感知走向决策(特邀)[J]. 计算机工程, 2026, 52(2): 1-6.
[10]	李博, 季佰军, 段湘煜. 基于译文易错词纠正机制的大语言模型机器翻译[J]. 计算机工程, 2026, 52(2): 372-382.
[11]	刘荣龙, 李梓炜, 万悦, 吴嘉婧, 蒋子规. 面向Web3钓鱼网站的域名检测与网页分析方法[J]. 计算机工程, 2026, 52(1): 76-85.
[12]	林丹, 卢顺峰, 刘姿妍, 张博昭, 何龙, 蒋子规, 吴嘉婧, 郑子彬. 大语言模型赋能区块链服务安全研究综述: 现状、挑战与机遇(特邀)[J]. 计算机工程, 2026, 52(1): 1-21.
[13]	张珑耀, 温东新, 马庄宇, 舒燕君, 李庆, 刘明义, 左德承. 基于大语言模型的多智能体系统异常综述(特邀)[J]. 计算机工程, 2026, 52(1): 22-32.
[14]	刘根壕, 张能, 郑子彬. 基于大语言模型的API使用约束知识构建[J]. 计算机工程, 2025, 51(8): 74-85.
[15]	龙丽叶, 焦世超, 郭磊, 韩燮, 况立群. 基于紧凑中心的多模态三维模型检索研究[J]. 计算机工程, 2025, 51(2): 322-334.

选择文件类型/文献管理软件名称

选择包含的内容

基于多模态融合的360°图像质量与美学评估方法

Multimodal Fusion-Based Method for 360° Image Quality and Aesthetic Assessment

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献 31

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于多模态融合的360°图像质量与美学评估方法

Multimodal Fusion-Based Method for 360° Image Quality and Aesthetic Assessment

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献 31

相关文章 15

编辑推荐

Metrics

本文评价