作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于改进ControlNet的可控水下图像生成方法

  • 发布日期:2025-09-09

Controllable Underwater Image Generation Method Based on Improved ControlNet

  • Published:2025-09-09

摘要: 水下图像生成技术作为填补海洋探索领域数据缺口的重要途径,生成图像的真实性和多样性将直接影响后续分析研究的可靠性。现有的模型通常参数量庞大,训练与推理过程耗时较长;生成的水下图像清晰度不足,图像主体的结构和边缘存在畸变现象;推理过程未能充分考虑水下环境独特的光学特性,生成图像的真实性有待优化。为此,基于ControlNet模型提出UW-ControlNet (Underwater ControlNet) 水下图像生成框架,对预训练的Stable Diffusion模型进行参数微调,将条件图像的结构约束与文本提示的语义约束相结合,实现水下图像的跨模态可控生成。引入轻量化特征提取网络,改进条件图像的特征提取过程,提高模型的收敛速度和推理速度。设计基于关联矩阵的通道注意力模块,将背景对应的全局通道特征与主体对应的局部通道特征进行解耦与耦合,优化生成过程中的文本-图像多模态对齐,增强生成结果的可信度。构建结构-语义约束增强模块,避免下采样过程导致的约束信息丢失,确保生成图像与条件图像的结构一致。实验结果表明,通过UW-ControlNet生成的水下图像在定量指标和定性对比上均优于现有方法,展现出良好的应用价值。

Abstract: Underwater image generation technology acts as a crucial solution for filling data gaps in marine exploration. The authenticity and diversity of generated images directly affecting the reliability of subsequent analytical studies. Existing models typically possess enormous parameter quantities, with prolonged training and inference processes; the generated underwater images suffer from insufficient clarity, and distortions exist in the structures and edges of image subjects; the inference process has not adequately considered the unique optical properties of underwater environments, so the authenticity of the generated images remains to be optimized. To resolve these issues, this paper proposes UW-ControlNet (Underwater ControlNet), a novel network built upon the ControlNet architecture performing parameter fine-tuning on a pretrained Stable Diffusion model. It combines structural constraints from conditional images with semantic guidance from textual prompts, achieving cross-modally controllable generation of underwater images. A lightweight feature extraction network is introduced to optimize the feature extraction process of conditional images, thereby enhancing the convergence speed and inference speed of the model. A correlation matrix-based channel attention model is designed to decouple and couple global channel features corresponding to the background with local channel features corresponding to the subject, optimize text-image multimodal alignment in the generation process, and enhance the credibility of the generation results. A Structure-Semantics Constraint Enhancement Module is constructed to prevent constraint information loss caused by the downsampling process, ensuring structure consistency between generated images and conditional images to be guaranteed. Experimental results confirm that UW-ControlNet surpasses existing methods in both quantitative metrics and qualitative evaluations, demonstrating significant application potential.