WAVE: Warping DDIM Inversion Features for Zero-shot Text-to-Video Editing

Our Results
Comparisons to Baselines
DDIM Inversion Comparison

Our Results

We present sample results of our method.

Input "Will Smith is speaking" "Stephen Curry is speaking" "Justin Bieber is speaking"

Input "A leopard" "A pink cat" "A tiger"

Input "American comic style" "Makoto Shinkai Style" "Sunset ambiance"

Input "A man wearing yellow trousers" "Comic Book Illustration" "Makoto Shinkai Style"

Input "A marble statue is dancing." "James bond is dancing" "Makoto Shinkai Style"

Input "A cheetah" "A husky" "A white wolf, in the ice and snow"

Input "A marble statue of a woman is running" "Adele is running, bieber hair" "Bronze statue of a woman"

Input	"Will Smith is speaking"	"Stephen Curry is speaking"	"Justin Bieber is speaking"

Input	"A leopard"	"A pink cat"	"A tiger"

Input	"American comic style"	"Makoto Shinkai Style"	"Sunset ambiance"

Input	"A man wearing yellow trousers"	"Comic Book Illustration"	"Makoto Shinkai Style"

Input	"A marble statue is dancing."	"James bond is dancing"	"Makoto Shinkai Style"

Input	"A cheetah"	"A husky"	"A white wolf, in the ice and snow"

Input	"A marble statue of a woman is running"	"Adele is running, bieber hair"	"Bronze statue of a woman"

Comparisons to Baselines

Existing methods of text-guided video editing suffer from temporal inconsistency.

Tune-a-video ([1]).
Fate-Zero ([2)
TokenFlow ([3])

Our method manages to preserve the structure of the guidance image while fulfilling the target text.

"Classic Sepia-Toned Photograph"	Ours	Tune-a-video ([1])	Fate-Zero ([2])	Fate-Zero+EbSynth ([2])	TokenFlow ([3])	TokenFlow+EbSynth ([3])

"Oil Painting of a Camel in the Desert"	Ours	Tune-a-video ([1])	Fate-Zero ([2])	Fate-Zero+EbSynth ([2])	TokenFlow ([3])	TokenFlow+EbSynth ([3])

"Storybook Illustration"	Ours	Tune-a-video ([1])	Fate-Zero ([2])	Fate-Zero+EbSynth ([2])	TokenFlow ([3])	TokenFlow+EbSynth ([3])

"Serene Watercolor Habitat"	Ours	Tune-a-video ([1])	Fate-Zero ([2])	Fate-Zero+EbSynth ([2])	TokenFlow ([3])	TokenFlow+EbSynth ([3])

"Sunset ambiance"	Ours	Tune-a-video ([1])	Fate-Zero ([2])	Fate-Zero+EbSynth ([2])	TokenFlow ([3])	TokenFlow+EbSynth ([3])

Comparison to DDIM Inversion

We provide a comparative analysis of reconstruction results between Naive DDIM inversion and RS (Randomly Shuffled) inversion, as well as the edited outputs from different inversion methods. Our RS inversion demonstrates superior capability in reconstructing the original video keyframes compared to the Naive DDIM inversion, subsequently leading to enhanced edited outputs.

Input DDIM inversion keyframe reconstruction DDIM inversion edited keyframes DDIM inversion edited result

Input RS inversion keyframe reconstruction RS inversion edited keyframes RS inversion edited result

Input	DDIM inversion keyframe reconstruction	DDIM inversion edited keyframes	DDIM inversion edited result

Input	RS inversion keyframe reconstruction	RS inversion edited keyframes	RS inversion edited result

Input	DDIM inversion keyframe reconstruction	DDIM inversion edited keyframes	DDIM inversion edited result

Input	RS inversion keyframe reconstruction	RS inversion edited keyframes	RS inversion edited result

Input	DDIM inversion keyframe reconstruction	DDIM inversion edited keyframes	DDIM inversion edited result

Input	RS inversion keyframe reconstruction	RS inversion edited keyframes	RS inversion edited result

Input	DDIM inversion keyframe reconstruction	DDIM inversion edited keyframes	DDIM inversion edited result

Input	RS inversion keyframe reconstruction	RS inversion edited keyframes	RS inversion edited result

Input	DDIM inversion keyframe reconstruction	DDIM inversion edited keyframes	DDIM inversion edited result

Input	RS inversion keyframe reconstruction	RS inversion edited keyframes	RS inversion edited result

Input	DDIM inversion keyframe reconstruction	DDIM inversion edited keyframes	DDIM inversion edited result

Input	RS inversion keyframe reconstruction	RS inversion edited keyframes	RS inversion edited result

References

[1] Wu, Jay Zhangjie, et al. "Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[2] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv:2303.09535, 2023 .

[3] Geyer, Michal, et al. "Tokenflow: Consistent diffusion features for consistent video editing." arXiv preprint arXiv:2307.10373 (2023).