Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2025, Vol. 51 ›› Issue (10): 346-356. doi: 10.19678/j.issn.1000-3428.0069375

• Graphics and Image Processing • Previous Articles     Next Articles

Multiple Video Frame Interpolation Method Based on Transformer and Enhanced Deformable Separable Convolution

SHI Changtong, SHAN Hongtao*()   

  1. School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China
  • Received:2024-02-19 Revised:2024-04-14 Online:2025-10-15 Published:2024-08-13
  • Contact: SHAN Hongtao

基于Transformer和增强可变形可分离卷积的多视频插帧方法

石昌通, 单鸿涛*()   

  1. 上海工程技术大学电子电气工程学院, 上海 201620
  • 通讯作者: 单鸿涛
  • 基金资助:
    国家自然科学基金(62173222)

Abstract:

Existing multiple Video Frame Interpolation (VFI) methods rely on optical flow or Convolutional Neural Networks (CNN) for implementation; however, they struggle to handle scenes with large motions effectively because of the inherent limitations of optical flow and CNN. In response to this challenge, this study proposes a multiple VFI method based on Transformer and enhanced deformable separable convolution. This method integrates attention mechanisms with both shifted and cross-scale windows, thereby enlarging the receptive field of attention. In addition, during frame synthesis, the method treats the time step as a key control variable input to the frame synthesis network, allowing interpolation at arbitrary time positions. Specifically, shallow features are extracted using embedding layers, followed by an encoder-decoder architecture to extract multi-scale deep features. Finally, a multi-scale, multi-frame synthesis network, based on enhanced deformable separable convolutions, takes multi-scale features, original video frames, and time step information as inputs to synthesize intermediate frames at any time position. Experimental results demonstrate that the proposed method achieves high interpolation performance on several commonly used multiple VFI datasets. Specifically, on the Vimeo90K septuplet dataset, the multi-frame interpolation method achieves Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) values of 27.98 dB and 0.912, respectively, while the performance of the single-frame interpolation method also reaches mainstream levels. Visualization results show that, compared with other methods, our method produces clearer and more reasonable intermediate frames in scenes with large motions.

Key words: multiple Video Frame Interpolation (VFI), enhanced deformable separable convolution, arbitrary time position frame interpolation, cross-scale window-based attention, large motion

摘要:

现有的多视频插帧(VFI)方法通常采用光流或卷积神经网络(CNN)来实现, 而受光流和CNN固有限制的影响难以有效处理大运动场景。针对该问题, 提出一种基于Transformer和增强可变形可分离卷积的多VFI方法, 该方法融合了移位窗口和跨尺度窗口的注意力, 扩大注意力的感受野, 并在合成帧时将时间步作为一个关键控制变量输入帧合成网络, 从而能够在任意时间位置插帧。具体而言, 首先使用嵌入层提取浅层特征; 随后使用编码器-解码器架构提取多尺度的深层特征; 最后使用以增强可变形可分离卷积为核心的多尺度多帧合成网络, 将多尺度特征、原视频帧和时间步信息共同输入帧合成网络, 利用多尺度信息合成任意时间位置对应的中间帧。实验结果表明, 该方法在多个视频插帧常用的数据集上实现了较高的插帧性能。其中, 多VFI方法在Vimeo90K septuplet数据集上的峰值信噪比(PSNR)值和结构相似性(SSIM)值分别达到了27.98 dB和0.912, 单VFI方法的插帧性能也达到了主流水平。同时可视化结果表明, 相较于其他方法, 该方法在大运动和大规模运动场景下能产生较为清晰合理的中间帧。

关键词: 多视频插帧, 增强可变形可分离卷积, 任意时间位置插帧, 基于跨尺度窗口的注意力, 大运动