基于改进ControlNet的可控水下图像生成方法

doi:10.19678/j.issn.1000-3428.0252345

摘要/Abstract

摘要： 水下图像生成技术作为填补海洋探索领域数据缺口的重要途径，生成图像的真实性和多样性将直接影响后续分析研究的可靠性。现有的模型通常参数量庞大，训练与推理过程耗时较长；生成的水下图像清晰度不足，图像主体的结构和边缘存在畸变现象；推理过程未能充分考虑水下环境独特的光学特性，生成图像的真实性有待优化。为此，基于ControlNet模型提出UW-ControlNet (Underwater ControlNet) 水下图像生成框架，对预训练的Stable Diffusion模型进行参数微调，将条件图像的结构约束与文本提示的语义约束相结合，实现水下图像的跨模态可控生成。引入轻量化特征提取网络，改进条件图像的特征提取过程，提高模型的收敛速度和推理速度。设计基于关联矩阵的通道注意力模块，将背景对应的全局通道特征与主体对应的局部通道特征进行解耦与耦合，优化生成过程中的文本-图像多模态对齐，增强生成结果的可信度。构建结构-语义约束增强模块，避免下采样过程导致的约束信息丢失，确保生成图像与条件图像的结构一致。实验结果表明，通过UW-ControlNet生成的水下图像在定量指标和定性对比上均优于现有方法，展现出良好的应用价值。

Abstract: Underwater image generation technology acts as a crucial solution for filling data gaps in marine exploration. The authenticity and diversity of generated images directly affecting the reliability of subsequent analytical studies. Existing models typically possess enormous parameter quantities, with prolonged training and inference processes; the generated underwater images suffer from insufficient clarity, and distortions exist in the structures and edges of image subjects; the inference process has not adequately considered the unique optical properties of underwater environments, so the authenticity of the generated images remains to be optimized. To resolve these issues, this paper proposes UW-ControlNet (Underwater ControlNet), a novel network built upon the ControlNet architecture performing parameter fine-tuning on a pretrained Stable Diffusion model. It combines structural constraints from conditional images with semantic guidance from textual prompts, achieving cross-modally controllable generation of underwater images. A lightweight feature extraction network is introduced to optimize the feature extraction process of conditional images, thereby enhancing the convergence speed and inference speed of the model. A correlation matrix-based channel attention model is designed to decouple and couple global channel features corresponding to the background with local channel features corresponding to the subject, optimize text-image multimodal alignment in the generation process, and enhance the credibility of the generation results. A Structure-Semantics Constraint Enhancement Module is constructed to prevent constraint information loss caused by the downsampling process, ensuring structure consistency between generated images and conditional images to be guaranteed. Experimental results confirm that UW-ControlNet surpasses existing methods in both quantitative metrics and qualitative evaluations, demonstrating significant application potential.

刘天权, 鹿存跃, 王晓龙, 罗润书. 基于改进ControlNet的可控水下图像生成方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252345.

LIU Tianquan, LU Cunyue, WANG Xiaolong, LUO Runshu. Controllable Underwater Image Generation Method Based on Improved ControlNet[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252345.

参考文献

[1]罗旭东, 袁笛, 常晓军, 何震宇. 基于不确定性启发图像增强的水下目标跟踪[J]. 计算机工程, 2025, 51(1): 11-19. LUO Xudong, YUAN Di, CHANG Xiaojun, HE Zhenyu. Underwater Target Tracking Based on Uncertainty-Inspired Image Enhancement[J]. Computer Engineering, 2025, 51(1): 11-19.
[2]ALABA S Y, NABI M M, SHAH C, et al. Class-Aware Fish Species Recognition Using Deep Learning for an Imbalanced Dataset[J]. Sensors, 2022, 22(21): 8268.
[3]刘子健, 王兴梅, 陈伟京, 等. 基于硬负样本对比学习的水下图像生成方法[J]. 模式识别与人工智能, 2024, 37(10):887-909. LIU Zijian, WANG Xingmei, CHEN Weijing, ZHANG Wansong, ZHANG Tianzi. Underwater Image Generation Method Based on Contrastive Learning with
Hard Negative Samples. PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2024,37(10): 887-909. [4]HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C] // Proc of 34th International Conference on Neural Information Processing Systems. Red Hook, NY, USA: MIT Press, 2020: 6840–6851.
[5]张云帆, 易尧华, 汤梓伟, 王新宇. 基于通道注意力机制的文本生成图像方法[J]. 计算机工程, 2022, 48(4): 206-212,222. ZHANG Yunfan, YI Yaohua, TANG Ziwei, WANG Xinyu. Text-to-Image Synthesis Method Based on Channel Attention Mechanism[J]. Computer Engineering, 2022, 48(4): 206-212,222.
[6]WANG N, ZHOU Y B,HAN F L, et al. UWGAN: Underwater GAN for Real-world Underwater Color Restoration and Dehazing [EB/OL]. [2021-03-26] https://arxiv.org/pdf/1912.10269
[7] LI J, SKINNER K A, EUSTICE R M, et al. WaterGAN: Unsupervised Generative Network to Enable Real-Time Color Correction of Monocular Underwater Images[J]. IEEE Robotics and Automation Letters, 2018, 3(1): 387-394. [8] KWON G, YE J C. Diffusion-based Image Translation using disentangled style and content representation[C]//Proceedings of the 11th International Conference on Learning Representations. Kigali, Rwanda. [9] ZHANG H Y, YAO F H, GONG Y F, et al. Marine Biology Image Generation Based on Diffusion-Stylegan2[J]. IEEE Access, 2024, 01: 1-1. [10] ZHANG L M, RAO A Y, AGRAWALA M. Adding Conditional Control to Text-to-Image Diffusion Models[C] // Proc of the IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE, 2023: 3813-3824. [11] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-Resolution Image Synthesis with Latent Diffusion Models[C] // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA: IEEE, 2022: 10674-10685. [12] PENG B H, WANG J, ZHANG Y C, et al. ControlNeXt: Powerful and efficient control for image and video generation [EB/OL]. [2024-08-12]. https://arxiv.org/pdf/2408.06070. [13] LI M, YANG T J N, KUANG H F, et al. ControlNet++: Improving conditional controls with efficient consistency feedback[C] // Proc of the 18th European Conference on Computer Vision. Milan, Italy: Springer, 2024: 129-147 [14] QIN C, ZHANG S, YU N, et al. UniControl: A unified diffusion model for controllable visual generation in the wild [EB/OL]. [2023-05-18] https://arxiv.org/pdf/2305.11147. [15] ZHAO S H, CHEN D D, CHEN Y C. Uni-ControlNet: All-in-one control to text-to-image diffusion models[C] // Proc of 37th International Conference on Neural Information Processing Systems. New Orleans, LA, USA: MIT Press, 2023: 11124-11150. [16] MOU C, WANG X T, XIE L B, et al. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models [EB/OL]. [2023-02-16] https://arxiv.org/pdf/2302.08453. [17] HU J, SHEN L, SUN G. Squeeze-and-Excitation Networks[C] // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 7132-7141. [18] LI X, WANG W H, HU X L, et al. Selective Kernel Networks[C] // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, CA, US: IEEE, 2019: 510-519. [19] WANG Q L, WU B G, ZHU P F, et al. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks[C] // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA: IEEE, 2020: 11531-11539. [20] SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks[C] // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 4510-4520. [21] OUYANG D L, HE S, ZHANG G Z, et al. Efficient Multi-Scale Attention Module with Cross-Spatial Learning[C] // IEEE International Conference on Acoustics, Speech and Signal Processing. Rhodes Island, Greece : IEEE, 2023: 1-5. [22] LI Y X, LI X, YANG J. Spatial Group-Wise Enhance: Enhancing Semantic Feature Learning in CNN[C] // Asian Conference on Computer Vision. Macau SAR, China: Springer, 2022: 316-332. [23] HU J, SHEN L, SUN G, et al. Gather-excite: exploiting feature context in convolutional neural networks[C] // International Conference on Neural Information Processing Systems. Montréal, Canada: MIT Press, 2018: 9423 – 9433. [24] Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention module[C] // Proceedings of the European conference on computer vision. Munich, German:2018: 3-19. [25] LIU H J, LIU F Q, FAN X Y, et al. Polarized self-attention: Towards high-quality pixel-wise mapping[J]. Neurocomputing, 2022, 506: 158-167. [26] HONG L, WANG X, ZHANG G, et al. USOD10K: A New Benchmark Dataset for Underwater Salient Object Detection[J]. IEEE Transactions on Image Processing, 2025, 34: 1602-1615. DOI: 10.1109/TIP.2023.3266163. [27] LI J N, LI D X, XION C M, et al. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation[C] // Proc of 39th International Conference on Machine Learning. Baltimore, Maryland, USA: PMLR, 2022: 12888–12900. [28] RANFTL R, LASINGER K, HAFNER D, et al. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44: 1623-1637. DOI: 10.1109/TPAMI.2020.3019967. [29] YANG M, SOWMYA A. An Underwater Color Image Quality Evaluation Metric[J]. IEEE Transactions on Image Processing, 2015: 24(12): 6062-6071. DOI: 10.1109/TIP.2015.2491020. [30] PANETTA K, GAO C, AGAIAN S. Human-Visual-System-Inspired Underwater Image Quality Measures[J]. IEEE Journal of Oceanic Engineering, 2016, 41(3): 541-551. DOI: 10.1109/JOE.2015.2469915. [31] HESSEL J, HOLTZMAN A, FORBES M, et al. CLIPScore: A Reference-free Evaluation Metric for Image Captioning[C] // Proc of Conference on Empirical Methods in Natural Language Processing. Punta Cana, Dominican Republic: ACL, 2021: 7514–7528. [32] YANG L H, KANG B Y, HUANG Z L, et al. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data[C] // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA: IEEE, 2024: 10371-10381.

选择文件类型/文献管理软件名称

选择包含的内容