WAVE: Warping DDIM Inversion Features for Zero-shot Text-to-Video Editing

 


Our Results

We present sample results of our method.
Input "Will Smith is speaking" "Stephen Curry is speaking" "Justin Bieber is speaking"

Input "A leopard" "A pink cat" "A tiger"

Input "American comic style" "Makoto Shinkai Style" "Sunset ambiance"

Input "A man wearing yellow trousers" "Comic Book Illustration" "Makoto Shinkai Style"

Input "A marble statue is dancing." "James bond is dancing" "Makoto Shinkai Style"

Input "A cheetah" "A husky" "A white wolf, in the ice and snow"

Input "A marble statue of a woman is running" "Adele is running, bieber hair" "Bronze statue of a woman"


Comparisons to Baselines

Existing methods of text-guided video editing suffer from temporal inconsistency.

Our method manages to preserve the structure of the guidance image while fulfilling the target text.

"Classic Sepia-Toned Photograph" Ours Tune-a-video ([1]) Fate-Zero ([2]) Fate-Zero+EbSynth ([2]) TokenFlow ([3]) TokenFlow+EbSynth ([3])
"Oil Painting of a Camel in the Desert" Ours Tune-a-video ([1]) Fate-Zero ([2]) Fate-Zero+EbSynth ([2]) TokenFlow ([3]) TokenFlow+EbSynth ([3])

"Storybook Illustration" Ours Tune-a-video ([1]) Fate-Zero ([2]) Fate-Zero+EbSynth ([2]) TokenFlow ([3]) TokenFlow+EbSynth ([3])

"Serene Watercolor Habitat" Ours Tune-a-video ([1]) Fate-Zero ([2]) Fate-Zero+EbSynth ([2]) TokenFlow ([3]) TokenFlow+EbSynth ([3])

"Sunset ambiance" Ours Tune-a-video ([1]) Fate-Zero ([2]) Fate-Zero+EbSynth ([2]) TokenFlow ([3]) TokenFlow+EbSynth ([3])

Comparison to DDIM Inversion

We provide a comparative analysis of reconstruction results between Naive DDIM inversion and RS (Randomly Shuffled) inversion, as well as the edited outputs from different inversion methods. Our RS inversion demonstrates superior capability in reconstructing the original video keyframes compared to the Naive DDIM inversion, subsequently leading to enhanced edited outputs.
Input DDIM inversion keyframe reconstruction DDIM inversion edited keyframes DDIM inversion edited result
Input RS inversion keyframe reconstruction RS inversion edited keyframes RS inversion edited result


Input DDIM inversion keyframe reconstruction DDIM inversion edited keyframes DDIM inversion edited result
Input RS inversion keyframe reconstruction RS inversion edited keyframes RS inversion edited result

Input DDIM inversion keyframe reconstruction DDIM inversion edited keyframes DDIM inversion edited result
Input RS inversion keyframe reconstruction RS inversion edited keyframes RS inversion edited result

Input DDIM inversion keyframe reconstruction DDIM inversion edited keyframes DDIM inversion edited result
Input RS inversion keyframe reconstruction RS inversion edited keyframes RS inversion edited result

Input DDIM inversion keyframe reconstruction DDIM inversion edited keyframes DDIM inversion edited result
Input RS inversion keyframe reconstruction RS inversion edited keyframes RS inversion edited result

Input DDIM inversion keyframe reconstruction DDIM inversion edited keyframes DDIM inversion edited result
Input RS inversion keyframe reconstruction RS inversion edited keyframes RS inversion edited result



 

References

[1] Wu, Jay Zhangjie, et al. "Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[2] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv:2303.09535, 2023 .

[3] Geyer, Michal, et al. "Tokenflow: Consistent diffusion features for consistent video editing." arXiv preprint arXiv:2307.10373 (2023).