VADER by mihirp1998

Video diffusion finetuning via reward gradients research paper

Created 1 year ago

303 stars

Top 88.4% on SourcePulse

Project Summary

This repository provides VADER (Video Diffusion Alignment via Reward Gradients), a method for fine-tuning video diffusion models to align with specific downstream tasks like aesthetic generation or text-video coherence. It targets researchers and developers working with foundational video diffusion models, offering an efficient alternative to supervised fine-tuning by leveraging pre-trained reward models.

How It Works

VADER utilizes dense gradient information from pre-trained reward models (e.g., HPS, PickScore, YOLO) with respect to generated pixels. This allows for efficient learning in complex video generation search spaces, enabling alignment with objectives like aesthetics, text-video similarity, and longer video generation without requiring extensive curated datasets.

Quick Start & Requirements

Installation: Requires Conda environment setup per model (VideoCrafter, Open-Sora, ModelScope). PyTorch 2.3.0+ and CUDA 12.1 are recommended. xformers is also a dependency.
Prerequisites: Specific base models (VideoCrafter2, Open-Sora v1.2, ModelScope) need to be downloaded or are fetched via Hugging Face. HPSv2 library must be installed.
Hardware: Inference for VideoCrafter2 requires ~16GB VRAM. Open-Sora inference needs ~40GB VRAM for 360p resolution. Training for Open-Sora with 360p/2s resolution requires 48GB VRAM. ModelScope training can work with >14GB VRAM, with 4x40GB A100s used for experiments.
Links: Website, Demo, arXiv

Highlighted Details

Supports fine-tuning of VideoCrafter2, Open-Sora v1.2, and ModelScope text-to-video models.
Enables alignment for aesthetic quality, text-video similarity, and longer horizon video generation.
Demonstrates more efficient learning in terms of reward queries and compute compared to gradient-free methods.
Includes baseline implementations for DPO and DDPO.

Maintenance & Community

The project is associated with authors from institutions like CMU. Links to a website and Hugging Face demo are provided.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. However, it builds upon other open-source projects, suggesting potential licensing considerations for commercial use.

Limitations & Caveats

Support for Stable Video Diffusion is listed as a planned feature but not yet implemented. The README notes potential issues with fp16 precision for certain Open-Sora configurations, recommending bf16 instead.

VADER by mihirp1998

Explore Similar Projects

t2v-turbo by Ji4chenLi

ViFi-CLIP by muzairkhattak

VideoTuna by VideoVerses

kandinsky-5 by kandinskylab

SEINE by Vchitect

Allegro by rhymes-ai

VBench by Vchitect

FastVideo by hao-ai-lab

Awesome-Video-Diffusion by showlab

Step-Video-T2V by stepfun-ai

Tune-A-Video by showlab

SkyReels-V2 by SkyworkAI