Lightweight DiT-Based Video Virtual Try-On via First-Frame Guidance and Temporal Priors

doi:10.19678/j.issn.1000-3428.0260235

Abstract

Abstract: Video virtual try-on technology aims to accurately transfer target garments onto human subjects in videos while maintaining high consistency between body motion and garment appearance, serving as a core technology in fields such as e-commerce, virtual reality, and short-video creation. However, existing technical frameworks still face significant challenges in balancing generation quality and computational efficiency. Traditional Generative Adversarial Network (GAN)-based methods often rely on optical flow estimation for garment warping, which are highly prone to texture distortion and visual artifacts when handling complex motions. In recent years, U-Net-based diffusion models have achieved high-fidelity generation by introducing a garment reference branch. However, when such dual-branch architectures are migrated to larger and more expressive Diffusion Transformer (DiT) backbones, they introduce substantial parameter redundancy and VRAM overhead. Furthermore, existing methods typically inject static garment features repeatedly during the denoising process of each frame. This not only significantly exacerbates the computational burden but also, due to the lack of natural temporal correlation in static features, makes it difficult for models to maintain spatiotemporal coherence during non-rigid deformations, resulting in severe flickering artifacts. To address the aforementioned challenges regarding the adaptability, training efficiency, and resource consumption of DiT architectures in video virtual try-on tasks, this study proposes a lightweight framework named OIE (Once is Enough). The OIE framework adopts a novel single-branch strategy featuring first-frame guidance and one-time injection, effectively decoupling garment editing from temporal generation tasks. First, during the garment appearance injection stage, a pre-trained high-fidelity image virtual try-on model, FiT-DiT, is utilized to precisely edit the video's initial frame, yielding results integrated with fine-grained garment textures. Second, to maximally preserve the temporal priors of the DiT model, only the edited first frame is embedded as the starting token into the latent feature sequence of the backbone network. This avoids the dense cross-branch feature interaction modules typical of traditional dual-branch architectures, achieving zero structural modification to the backbone. Additionally, to address the loss of background layout information caused by human motion, this method designs a lightweight background encoder that smoothly accumulates background information into the backbone features via a mask guider. Finally, during the fine-tuning stage, Low-Rank Adaptation (LoRA) is applied to all self-attention, cross-attention, and feed-forward network (FFN) modules of the DiT, enabling dynamic regulation of the large-scale parameter model with an extremely low number of trainable parameters. Experiments conducted on the ViViD and VVT datasets yield quantitative evaluation results demonstrating that, in terms of efficiency, OIE introduces only a 0.50% additional parameter overhead, with FLOPs and FPS remaining virtually unchanged. Its performance significantly surpasses dual-branch methods such as MagicTryOn (15.11% parameter increase) and ViViD (157.10% parameter increase). Regarding quality metrics, OIE achieves competitive video quality scores under both paired and unpaired settings on the ViViD dataset, attaining a VFIDp of 9.3983 and a VFIDu of 17.0831, significantly leading existing mainstream methods. Ablation studies confirm that high-quality first-frame guidance effectively suppresses error generation in the early stages of synthesis, improving the SSIM metric to 0.8466. Through its decoupling strategy, the OIE framework effectively resolves the computational burden of DiT architectures in video generation, achieving an excellent balance among garment fidelity, temporal coherence, and computational efficiency. This method demonstrates that leveraging strong temporal priors within a single-branch architecture can replace high-frequency feature injection, offering a highly valuable lightweight pathway for high-resolution and real-time video editing tasks.

摘要： 视频虚拟试穿技术旨在将目标服装精准地迁移至视频中的人物主体，同时保持人物动作与服装外观的高度一致性，是电子商务、虚拟现实及短视频创作等领域的核心技术。然而，现有的技术框架在生成质量与计算效率的权衡上仍面临巨大挑战。传统的基于生成对抗网络（GAN）的方法往往依赖光流估计进行服装变形，在处理复杂运动时极易产生纹理失真和视觉伪影。近年来，基于U-Net的扩散模型通过引入服装参考分支实现了高保真生成，但此类双分支架构在迁移至参数量更庞大、表征能力更强的扩散Transformer（DiT）主干网络时，会引入巨大的参数冗余与显存开销。此外，现有方法通常在每一帧去噪过程中重复注入静态服装特征，这不仅显著加剧了计算负担，还因为静态特征缺乏天然的时序关联，导致模型在处理非刚性形变时难以维持时空连贯性，产生严重的闪烁现象。针对上述DiT架构在视频虚拟试穿任务中的适应性、训练效率及资源消耗挑战，本研究提出了一种名为OIE（Once is Enough）的轻量化视频虚拟试穿框架。OIE框架采用了首帧引导、单次注入的新型单分支策略，将服装编辑与时序生成任务解耦。首先，在服装外观注入阶段，利用预训练的高保真图像虚拟试穿模型Fitdit对视频初始帧进行精确编辑，获取集成细粒度服装纹理的结果。其次，为了最大限度保留DiT模型的时序先验，仅将编辑后的首帧作为潜在空间特征序列的起始token嵌入到主干网络中，避免了传统双分支架构中密集的跨分支特征交互模块，实现了主干网络架构的零修改。此外，为解决人体运动导致的背景布局信息丢失，本方法设计了一个轻量级背景编码器，通过掩码引导器平滑地将背景信息累加至主干特征中。最后，在微调阶段，将低秩自适应（LoRA）技术应用于DiT的所有自注意力、交叉注意力及前馈网络模块，以极低的可训练参数量实现了对大规模参数模型的动态调节。实验在ViViD和VVT数据集上进行，定量评估结果显示：在效率方面，OIE仅引入了0.50%的额外参数开销，且FLOPs和FPS几乎无显著变化，表现大幅优于MagicTryOn（参数增幅15.11%）和ViViD（参数增幅157.10%）等双分支方法。在质量指标方面，OIE在ViViD数据集的配对（paired）和非配对（unpaired）设置下均取得了较好的视频质量评分，其中VFIDp达到9.3983，VFIDu达到17.0831，显著领先于现有主流方法。消融实验证实，高质量的首帧引导能够有效抑制生成初期的错误产生，SSIM指标可提升至0.8466。OIE框架通过解耦策略有效解决了DiT架构在视频生成中的计算负担问题，实现了服装保真度、时序连贯性与计算效率的卓越平衡。该方法证明了在单分支架构下利用强大的时序先验知识可替代高频次的特征注入，为高分辨率、实时化的视频编辑任务提供了极具参考价值的轻量化路径。

YanJie Pan, Chi Mingmin, PENG Bo. Lightweight DiT-Based Video Virtual Try-On via First-Frame Guidance and Temporal Priors[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0260235.

潘言颉, 池明旻, 彭博. 基于首帧引导与时序先验的轻量化DiT视频虚拟试穿方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0260235.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0260235

References

[1] 朱欣娟, 徐晨溦. 基于风格迁移的虚拟试穿研究[J]. 纺织高校基础科学学报, 2023, 36(1): 65. Xinjuan Z, Chenwei X . Research on virtual try-on based on style transfer[J]. Basic Sciences Journal of Textile Universities, 2023, 36(1): 65.
[2] 祖雅妮, 张毅. 基于大规模预训练文本图像模型的虚试穿方法[J]. 丝绸杂志社, 2023, 60(8): 99. Yani Z, Yi Z. Virtual try-on method based on large-scale pre-trained text-image models[J]. Journal of Silk, 2023, 60(8): 99.
[3] 黄东晋, 李晓敏, 刘金华, 等. 基于姿势引导下的虚拟试穿网络[J]. 上海大学学报, 2024, 30(3): 491. Dongjin H, Xiaomin L, Jinhua L, et al. Pose-guided virtual try-on network[J]. Journal of Shanghai University, 2024, 30(3): 491.
[4] Fang Z, Zhai W, Su A, et al. Vivid: Video virtual try-on using diffusion models[J]. arXiv preprint arXiv:2405.11794, 2024.
[5] Li G, Zheng S, Zhang H, et al. MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on[J]. arXiv preprint arXiv:2505.21325, 2025.
[6] Chong Z, Zhang W, Zhang S, et al. Catv2ton: Taming diffusion transformers for vision-based virtual try-on with temporal concatenation[J]. arXiv preprint arXiv:2501.11325, 2025.
[7] Jiang J, Wang T, Yan H, et al. Clothformer: Taming video virtual try-on in all module[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 10799-10808.
[8] Deng Z, He X, Peng Y, et al. MV-Diffusion: Motion-aware video diffusion model[C]//Proceedings of the 31st ACM International Conference on Multimedia. 2023: 7255-7263.
[9] Dong H, Liang X, Shen X, et al. Fw-gan: Flow-navigated warping gan for video virtual try-on[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 1161-1170.
[10] Zheng J, Wang J, Zhao F, et al. Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism[J]. arXiv preprint arXiv:2412.09822, 2024.
[11] Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 10684-10695.
[12] Blattmann A, Dockhorn T, Kulal S, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets[J]. arXiv preprint arXiv:2311.15127, 2023.
[13] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.
[14] Nguyen H, Nguyen Q Q V, Nguyen K, et al. Swifttry: Fast and consistent video virtual try-on with diffusion models[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2025, 39(6): 6200-6208.
[15] He Z, Chen P, Wang G, et al. Wildvidfit: Video virtual try-on in the wild via image-based controlled diffusion models[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 123-139.
[16] Xu Z, Chen M, Wang Z, et al. Tunnel try-on: Excavating spatial-temporal tunnels for high-quality virtual try-on in videos[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 3199-3208.
[17] Peebles W, Xie S. Scalable diffusion models with transformers[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2023: 4195-4205.
[18] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[J]. Advances in neural information processing systems, 2020, 33: 6840-6851.
[19] Wan T, Wang A, Ai B, et al. Wan: Open and advanced large-scale video generative models[J]. arXiv preprint arXiv:2503.20314, 2025.
[20] Kong W, Tian Q, Zhang Z, et al. Hunyuanvideo: A systematic framework for large video generative models, 2025[J]. URL https://arxiv.org/abs/2412.03603.
[21] Jiang B, Hu X, Luo D, et al. Fitdit: Advancing the authentic garment details for high-fidelity virtual try-on[J]. arXiv preprint arXiv:2411.10499, 2024.
[22] Hu E J, Shen Y, Wallis P, et al. LoRA: Low-rank adaptation of large language models[J]. ICLR, 2022, 1(2): 3.
[23] Choi S, Park S, Lee M, et al. VITON-HD: High-resolution virtual try-on via misalignment-aware normalization[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 14131-14140.
[24] Esser P, Kulal S, Blattmann A, et al. Scaling rectified flow transformers for high-resolution image synthesis[C]//Forty-first international conference on machine learning. 2024.
[25] Black Forest Labs, “Flux,” https://github.com/ black-forest-labs/flux, 2024.
[26] Wu J Z, Ge Y, Wang X, et al. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2023: 7623-7633.
[27] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3d convolutional networks[C]//Proceedings of the IEEE international conference on computer vision. 2015: 4489-4497.
[28] Polyak A, Zohar A, Brown A, et al. Movie gen: A cast of media foundation models[J]. arXiv preprint arXiv:2410.13720, 2024.
[29] Yang Z, Teng J, Zheng W, et al. Cogvideox: Text-to-video diffusion models with an expert transformer[J]. arXiv preprint arXiv:2408.06072, 2024.
[30] Lipman Y, Chen R T Q, Ben-Hamu H, et al. Flow matching for generative modeling[J]. arXiv preprint arXiv:2210.02747, 2022.
[31] Zhang R, Isola P, Efros A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 586-595.
[32] Wang Z, Bovik A C, Sheikh H R, et al. Image quality assessment: from error visibility to structural similarity[J]. IEEE transactions on image processing, 2004, 13(4): 600-612.
[33] Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6299-6308.
[34] Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018: 6546-6555.
[35] Kim J, Gu G, Park M, et al. StableVITON: Learning semantic correspondence with latent diffusion model for virtual try-on[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 8176-8185.
[36] Xu Y, Gu T, Chen W, et al. OOTDiffusion: Outfitting fusion based latent diffusion for controllable virtual try-on[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2025, 39(9): 8996-9004.
[37] Choi Y, Kwak S, Lee K, et al. Improving diffusion models for authentic virtual try-on in the wild[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 206-235.

Please choose a citation manager

Content to export