DiTVR: Zero-Shot Diffusion Transformer for Video Restoration

Sicheng Gao  |  Nancy Mehta  |  Zongwei Wu  |  Radu Timofte
Computer Vision Lab, CAIDAS & IFI, University of Wurzburg

DiTVR performs multiple video restoration tasks including super-resolution, denoising, deblurring, and colorization.

Abstract

Video restoration aims to reconstruct high-quality video sequences from low-quality inputs, addressing tasks such as super-resolution, denoising, and deblurring. Traditional regression-based methods often produce unrealistic details and require extensive paired datasets, while generative diffusion models face challenges in ensuring temporal consistency. To overcome these limitations, we introduce a zero-shot video restoration framework that leverages a pre-trained Diffusion Transformer operating directly in pixel space. Unlike prior methods that utilize optical flow solely for structural guidance, our framework integrates a spatiotemporal neighbor cache, trajectory-aware attention, and a flow-guided diffusion sampler, all driven by optical flow trajectories extracted from low-quality inputs. These components enhance spatial alignment and robust temporal coherence across frames. Additionally, our flow-guided diffusion sampler refines detail reconstruction during inference, ensuring temporal alignment and realistic restoration simultaneously.

Video Super-resolution Comparison on DAVIS

Video Denoising Comparison on DAVIS

Video Super-resolution Comparison on SPMC

Video Denoising Comparison on SPMC