Multiple Video Frame Interpolation Method Based on Transformer and Enhanced Deformable Separable Convolution

doi:10.19678/j.issn.1000-3428.0069375

Abstract

Abstract:

Existing multiple Video Frame Interpolation (VFI) methods rely on optical flow or Convolutional Neural Networks (CNN) for implementation; however, they struggle to handle scenes with large motions effectively because of the inherent limitations of optical flow and CNN. In response to this challenge, this study proposes a multiple VFI method based on Transformer and enhanced deformable separable convolution. This method integrates attention mechanisms with both shifted and cross-scale windows, thereby enlarging the receptive field of attention. In addition, during frame synthesis, the method treats the time step as a key control variable input to the frame synthesis network, allowing interpolation at arbitrary time positions. Specifically, shallow features are extracted using embedding layers, followed by an encoder-decoder architecture to extract multi-scale deep features. Finally, a multi-scale, multi-frame synthesis network, based on enhanced deformable separable convolutions, takes multi-scale features, original video frames, and time step information as inputs to synthesize intermediate frames at any time position. Experimental results demonstrate that the proposed method achieves high interpolation performance on several commonly used multiple VFI datasets. Specifically, on the Vimeo90K septuplet dataset, the multi-frame interpolation method achieves Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) values of 27.98 dB and 0.912, respectively, while the performance of the single-frame interpolation method also reaches mainstream levels. Visualization results show that, compared with other methods, our method produces clearer and more reasonable intermediate frames in scenes with large motions.

Key words: multiple Video Frame Interpolation (VFI), enhanced deformable separable convolution, arbitrary time position frame interpolation, cross-scale window-based attention, large motion

摘要：

现有的多视频插帧(VFI)方法通常采用光流或卷积神经网络(CNN)来实现, 而受光流和CNN固有限制的影响难以有效处理大运动场景。针对该问题, 提出一种基于Transformer和增强可变形可分离卷积的多VFI方法, 该方法融合了移位窗口和跨尺度窗口的注意力, 扩大注意力的感受野, 并在合成帧时将时间步作为一个关键控制变量输入帧合成网络, 从而能够在任意时间位置插帧。具体而言, 首先使用嵌入层提取浅层特征; 随后使用编码器-解码器架构提取多尺度的深层特征; 最后使用以增强可变形可分离卷积为核心的多尺度多帧合成网络, 将多尺度特征、原视频帧和时间步信息共同输入帧合成网络, 利用多尺度信息合成任意时间位置对应的中间帧。实验结果表明, 该方法在多个视频插帧常用的数据集上实现了较高的插帧性能。其中, 多VFI方法在Vimeo90K septuplet数据集上的峰值信噪比(PSNR)值和结构相似性(SSIM)值分别达到了27.98 dB和0.912, 单VFI方法的插帧性能也达到了主流水平。同时可视化结果表明, 相较于其他方法, 该方法在大运动和大规模运动场景下能产生较为清晰合理的中间帧。

关键词: 多视频插帧, 增强可变形可分离卷积, 任意时间位置插帧, 基于跨尺度窗口的注意力, 大运动

SHI Changtong, SHAN Hongtao. Multiple Video Frame Interpolation Method Based on Transformer and Enhanced Deformable Separable Convolution[J]. Computer Engineering, 2025, 51(10): 346-356.

石昌通, 单鸿涛. 基于Transformer和增强可变形可分离卷积的多视频插帧方法[J]. 计算机工程, 2025, 51(10): 346-356.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0069375

https://www.ecice06.com/EN/Y2025/V51/I10/346

Figures/Tables 14

Fig.1 The overall structure of the model

Fig.2 Shifted-window partition strategy

Fig.3 Structure of CSWA

Fig.4 Structure of two successive SwinCS blocks

Fig.5 Structure of Multi-SynBlock

Fig.6 Structure of the sub-network of estimators

Fig.7 Comparison of multi-frame interpolation visualization results of each model

Fig.8 Comparison of single-frame interpolation visualization results of each models

Fig.9 Visual comparison of ablation experiments in each models

References 26

1	JIANG H Z, SUN D Q, JAMPANI V, et al. Super SloMo: high quality estimation of multiple intermediate frames for video interpolation[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 9000-9008.
2	XU X, SIYAO L, SUN W, et al. Quadratic video interpolation[C]//Proceedings of Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2019: 32-43.
3	BAO W B, LAI W S, MA C, et al. Depth-aware video frame interpolation[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE Press, 2019: 3703-3712.
4	PARK J, KO K, LEE C, et al. BMBC: bilateral motion estimation with bilateral cost volume for video interpolation[C]//Proceedings of the 16th European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 109-125.
5	NIKLAUS S, LIU F. Softmax splatting for video frame interpolation[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE Press, 2020: 5437-5446.
6	童超宇. 基于光流估计的复杂运动场景视频插帧算法研究[D]. 哈尔滨: 哈尔滨工业大学, 2021.
	TONG C Y. Research on video frame interpolation algorithm for complex motion scene based on optical flow estimation[D]. Harbin: Harbin Institute of Technology, 2021. (in Chinese)
7	NIKLAUS S, MAI L, LIU F. Video frame interpolation via adaptive convolution[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE Press, 2017: 670-679.
8	NIKLAUS S, MAI L, LIU F. Video frame interpolation via adaptive separable convolution[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 261-270.
9	LEE H, KIM T, CHUNG T Y, et al. AdaCoF: adaptive collaboration of flows for video frame interpolation[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 5316-5325.
10	CHENG X , CHEN Z . Multiple video frame interpolation via enhanced deformable separable convolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (10): 7029- 7045. doi: 10.1109/TPAMI.2021.3100714
11	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 130-141.
12	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of the 16th European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 213-229.
13	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL]. [2024-01-10]. https://arxiv.org/abs/2010.11929.
14	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2021: 10012-10022.
15	SHI Z H, XU X Y, LIU X H, et al. Video frame interpolation transformer[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE Press, 2022: 17482-17491.
16	LU L Y, WU R Z, LIN H J, et al. Video frame interpolation with transformer[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2022: 3532-3542.
17	PARK J, KIM J, KIM C S. BiFormer: learning bilateral motion estimation via bilateral transformer for 4K video frame interpolation[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2023: 1568-1577.
18	李翔, 张涛, 张哲, 等. Transformer在计算机视觉领域的研究综述. 计算机工程与应用, 2023, 59 (1): 1- 14.
	LI X , ZHANG T , ZHANG Z , et al. Survey of Transformer research in computer vision. Computer Engineering and Applications, 2023, 59 (1): 1- 14.
19	HENDRYCKS D, GIMPEL K. Gaussian error linear units[EB/OL]. [2024-01-10]. https://arxiv.org/abs/1606.08415v5.
20	BA J L, KIROS J R, HINTON G E. Layer normalization[EB/OL]. [2024-01-10]. https://arxiv.org/abs/1607.06450v1.
21	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 770-778.
22	XUE T F , CHEN B A , WU J J , et al. Video enhancement with task-oriented flow. International Journal of Computer Vision, 2019, 127 (8): 1106- 1125. doi: 10.1007/s11263-018-01144-2
23	SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[EB/OL]. [2024-01-10]. https://arxiv.org/abs/1212.0402v1.
24	PERAZZI F, PONT-TUSET J, MCWILLIAMS B, et al. A benchmark dataset and evaluation methodology for video object segmentation[C]// Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 724-732.
25	KINGMA D P, BA J, HAMMAD M M. Adam: a method for stochastic optimization[EB/OL]. [2024-01-10]. https://arxiv.org/abs/1412.6980v9.
26	DING T Y, LIANG L M, ZHU Z H, et al. CDFI: compression-driven network design for frame interpolation[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 8001-8011.

Please choose a citation manager

Content to export