Step Video T2V: Revolutionizing Text-to-Video Generation

Introduction

Step Video T2V is a cutting-edge text-to-video model that leverages advanced machine learning techniques to transform textual descriptions into high-quality video content. With 30 billion parameters, this model can generate videos up to 204 frames, making it a powerful tool for content creators and developers. The model employs a deep compression Variational Autoencoder (VAE) to enhance both training and inference efficiency, achieving significant spatial and temporal compression. To further improve video quality, Direct Preference Optimization (DPO) is applied, ensuring the generated videos meet high visual standards. For more information, check out Image to Video AI and Try Step Video T2V.

Model Summary

Step Video T2V utilizes a high-compression Video-VAE, achieving 16x16 spatial and 8x temporal compression ratios. It encodes user prompts using bilingual pre-trained text encoders, supporting both English and Chinese. The model's architecture includes a DiT with 3D full attention, trained using Flow Matching to denoise input noise into latent frames. Text embeddings and timesteps serve as conditioning factors, enhancing the visual quality of the generated videos through a video-based DPO approach.

Video-VAE

The Video-VAE is designed for video generation tasks, achieving high compression while maintaining exceptional video reconstruction quality. This compression accelerates training and inference, aligning with the diffusion process's preference for condensed representations.

DiT with 3D Full Attention

Built on the DiT architecture, Step Video T2V features 48 layers with 48 attention heads per layer. AdaLN-Single incorporates the timestep condition, while QK-Norm in the self-attention mechanism ensures training stability. 3D RoPE is employed to handle sequences of varying video lengths and resolutions.

Video-DPO

Human feedback is incorporated through Direct Preference Optimization (DPO) to enhance the visual quality of the generated videos. DPO leverages human preference data to fine-tune the model, ensuring the generated content aligns with human expectations.

Model Download

The Step Video T2V model is available for download on platforms like Huggingface and Modelscope. It includes both the standard and Turbo versions, the latter featuring Inference Step Distillation for faster processing.

Model Usage

Requirements

To run the Step Video T2V model, an NVIDIA GPU with CUDA support is required. The model is tested on four GPUs, with a recommendation for GPUs with 80GB of memory for optimal generation quality. The tested operating system is Linux, and the text-encoder supports specific CUDA capabilities.

Dependencies and Installation

The model requires Python 3.10 or higher, PyTorch 2.3-cu121, CUDA Toolkit, and FFmpeg. Installation involves cloning the repository, setting up a conda environment, and installing necessary packages.

Inference Scripts

For multi-GPU parallel deployment, a decoupling strategy optimizes GPU resource utilization. A dedicated GPU handles API services for the text encoder's embeddings and VAE decoding. Single-GPU inference and quantization are supported by the open-source project DiffSynth-Studio.

Best-of-Practice Inference Settings

Step Video T2V consistently generates high-fidelity and dynamic videos. Optimal results are achieved by tuning inference parameters, balancing video fidelity and dynamics.

Benchmark

Step Video T2V Eval is a new benchmark featuring 128 Chinese prompts from real users. It evaluates video quality across 11 categories, including Sports, Food, Scenery, and more.

Online Engine

The online version of Step Video T2V is available on 跃问视频, offering impressive examples and further exploration of the model's capabilities.

Citation

For academic referencing, please use the provided BibTeX citation.

Acknowledgement

We express our gratitude to the xDiT team for their support and parallelization strategy. Our code will be integrated into the official Huggingface/Diffusers repository. We also thank the FastVideo team for their collaboration and look forward to launching inference acceleration solutions together.