Video-T1  by liuff19

Research paper for test-time scaling (TTS) in video generation

created 4 months ago
282 stars

Top 93.5% on sourcepulse

GitHubView on GitHub
Project Summary

Video-T1 addresses the challenge of improving video generation quality and prompt consistency through test-time scaling (TTS). It targets researchers and practitioners in generative AI, offering a method to enhance existing video generation models without retraining.

How It Works

Video-T1 employs a two-pronged search strategy: Random Linear Search and Tree of Frames (ToF) Search. Random Linear Search involves sampling Gaussian noises, generating video clips via step-by-step denoising, and selecting the highest-scoring output based on test verifiers. The ToF Search refines this by dividing the process into stages: image-level alignment for later frames, dynamic prompt guidance focusing on motion stability and physical plausibility, and a final assessment of overall video quality against text prompts. This staged, guided search allows for more efficient exploration of the generation space, leading to higher quality outputs.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n videot1 python==3.10), activate it (conda activate videot1), and install dependencies (pip install -r requirements.txt). Additionally, clone and install LLaVA-NeXT (git clone https://github.com/LLaVA-VL/LLaVA-NeXT && cd LLaVA-NeXT && pip install --no-deps -e ".[train]").
  • Model Checkpoints: Requires downloading checkpoints for Pyramid-Flow, VisionReward-Video, and optionally Image-CoT-Generation and a large language model like DeepSeek-R1-Distill-Llama-8B.
  • Inference: Run via python -m videot1.py --prompt "..." --video_name .... Multi-GPU inference is supported via videot1_multigpu.py.
  • Resources: Requires significant GPU resources for model checkpoints and inference.

Highlighted Details

  • Demonstrates consistent performance improvements with increased test-time computation.
  • Supports both Random Linear Search and a more sophisticated Tree of Frames (ToF) Search.
  • Offers multi-GPU inference to manage memory constraints.
  • Allows fine-grained control over generation through parameters like num_inference_steps, video_branching_factors, and image_branching_factors.

Maintenance & Community

The project is associated with Tsinghua University. Further community engagement details (Discord/Slack, roadmap) are not explicitly provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as a research contribution (ICCV 2025) and may be in an early stage. Specific hardware requirements for optimal performance and detailed compatibility information are not fully elaborated. The need for multiple large model checkpoints implies a substantial resource footprint.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
31 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.