Video-T1 by liuff19

Research paper for test-time scaling (TTS) in video generation

Created 9 months ago

303 stars

Top 88.4% on SourcePulse

Project Summary

Video-T1 addresses the challenge of improving video generation quality and prompt consistency through test-time scaling (TTS). It targets researchers and practitioners in generative AI, offering a method to enhance existing video generation models without retraining.

How It Works

Video-T1 employs a two-pronged search strategy: Random Linear Search and Tree of Frames (ToF) Search. Random Linear Search involves sampling Gaussian noises, generating video clips via step-by-step denoising, and selecting the highest-scoring output based on test verifiers. The ToF Search refines this by dividing the process into stages: image-level alignment for later frames, dynamic prompt guidance focusing on motion stability and physical plausibility, and a final assessment of overall video quality against text prompts. This staged, guided search allows for more efficient exploration of the generation space, leading to higher quality outputs.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n videot1 python==3.10), activate it (conda activate videot1), and install dependencies (pip install -r requirements.txt). Additionally, clone and install LLaVA-NeXT (git clone https://github.com/LLaVA-VL/LLaVA-NeXT && cd LLaVA-NeXT && pip install --no-deps -e ".[train]").
Model Checkpoints: Requires downloading checkpoints for Pyramid-Flow, VisionReward-Video, and optionally Image-CoT-Generation and a large language model like DeepSeek-R1-Distill-Llama-8B.
Inference: Run via python -m videot1.py --prompt "..." --video_name .... Multi-GPU inference is supported via videot1_multigpu.py.
Resources: Requires significant GPU resources for model checkpoints and inference.

Highlighted Details

Demonstrates consistent performance improvements with increased test-time computation.
Supports both Random Linear Search and a more sophisticated Tree of Frames (ToF) Search.
Offers multi-GPU inference to manage memory constraints.
Allows fine-grained control over generation through parameters like num_inference_steps, video_branching_factors, and image_branching_factors.

Maintenance & Community

The project is associated with Tsinghua University. Further community engagement details (Discord/Slack, roadmap) are not explicitly provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as a research contribution (ICCV 2025) and may be in an early stage. Specific hardware requirements for optimal performance and detailed compatibility information are not fully elaborated. The need for multiple large model checkpoints implies a substantial resource footprint.

Video-T1 by liuff19

Explore Similar Projects

EasyCache by H-EmbodVis

VideoChat-Flash by OpenGVLab

LLaVA-Mini by ictnlp

MovieChat by rese1f

Allegro by rhymes-ai

VBench by Vchitect

FastVideo by hao-ai-lab

Pyramid-Flow by jy0205

Step-Video-T2V by stepfun-ai

ComfyUI-WanVideoWrapper by kijai

HunyuanVideo by Tencent-Hunyuan

FramePack by lllyasviel