Enabling Versatile Controls for Video Diffusion Models

Xu Zhang1, Hao Zhou1, Haoming Qin1,2, Xiaobin Lu1,3, Jiaxing Yan1, Guanzhong Wang1, Zeyu Chen1, Yi Liu1
1PaddlePaddle Team, Baidu Inc. 2Xiamen University 3Sun Yat-sen University

DEMOS SHOW

Abstract

Despite substantial progress in text-to-video generation, achieving precise and flexible control over fine-grained spatiotemporal attributes remains a significant unresolved challenge in video generation research. To address these limitations, we introduce VCtrl (also termed PP-VCtrl), a novel framework designed to enable fine-grained control over pre-trained video diffusion models in a unified manner. VCtrl integrates diverse user-specified control signals-such as Canny edges, segmentation masks, and human keypoints-into pretrained video diffusion models via a generalizable conditional module capable of uniformly encoding multiple types of auxiliary signals without modifying the underlying generator. Additionally, we design a unified control signal encoding pipeline and a sparse residual connection mechanism to efficiently incorporate control representations. Comprehensive experiments and human evaluations demonstrate that VCtrl effectively enhances controllability and generation quality.

How does VCtrl works?

  • We propose a Unified Control Signal Encoding pipeline. It transforms diverse video-based control signals into a unified latent representation, incorporating adaptive masks to enhance cross-condition adaptability.
  • We introduce VCtrl, a lightweight auxiliary control module. This Transformer-based module efficiently encodes abstract conditioning signals into residual representations, enabling precise and fine-grained control over the base model's internal representations.
  • We propose a Sparse Residual Connection Mechanism. This approach integrates residual representations from VCtrl into pre-trained models, effectively balancing control accuracy and computational efficiency.
PP-VCtrl Video-Image Finetuning

Overview architecture of VCtrl. A control signal (e.g., Canny edges, semantic masks, or pose keypoints) is first encoded by the control encoder. The resulting representation is then additively combined with latent and incorporated into the Video Diffusion Model via the proposed VCtrl module, which leverages a sparse residual connection mechanism. After several iterative denoising steps, the refined latent is decoded by a pretrained VAE to produce the final video.

Comparisons with Other Methods

Canny-to-Video

Mask-to-Video

Pose-to-Video

Quantitative Results

We present a comprehensive quantitative evaluation of our methods against existing representative approaches across three video generation tasks. For each task, we select suitable benchmarks and both established and newly proposed metrics to ensure a thorough comparison. Here are the quantitative results.

Canny Comparison Case 2

Diversity of Generated Video

Our method demonstrates strong generalizability by generating diverse video outputs from the same control signals, achieved through altering the initial reference frame. This capability underscores the flexibility and scalability of the proposed framework. Here are the various generated videos.

BibTeX

@article{zhang2025enablingversatilecontrolsvideo,
  title={Enabling Versatile Controls for Video Diffusion Models},
  author={Xu Zhang and Hao Zhou and Haoming Qin and Xiaobin Lu and Jiaxing Yan and Guanzhong Wang and Zeyu Chen and Yi Liu},
  journal={arXiv preprint arXiv:2503.16983},
  year={2025},
}