UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution

1Tsinghua University,  2Huazhong University of Science and Technology,
3Kling Team, Kuaishou Technology
Preprint

TL;DR: We propose UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos, which supports 4K controllable video generation for the first time.


1080P Text-to-video Generation (Input vs Output)

4K Text-to-video Generation (Input vs Output)

1080P Multi-ID Image-guided Text-to-video Generation (Input vs Output)

4K Multi-ID Image-guided Text-to-video Generation (Input vs Output)

1080P Text-guided Video Editing (Input vs Output)

4K Text-guided Video Editing (Input vs Output)

Abstract

Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K videos—a feat previously unattainable with existing techniques.

Method

Overview of UniMMVSR in the context of a cascaded generation framework. Upsampler denotes the sequential operations of VAE decoding, upscaling via bilinear interpolation, and VAE encoding. TC and CC denote token concatenation and channel concatenation respectively. Texts are encoded by text encoder and then injected via cross-attention layers.

Degradation Pipeline

We first perform sdedit degradation to modify the local structure of HR video using the sdedit method by our text-to-video base model. Afterwards, we apply traditional synthetic degradation to introduce high-frequency degradation pattern.