JavisDiT by JavisVerse

Synchronized audio-video generation from text

Created 11 months ago

317 stars

Top 85.6% on SourcePulse

Project Summary

Summary JavisDiT addresses synchronized audio-video generation (JAVG) from text prompts using a novel Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization. It aims to overcome key bottlenecks in generating coherent content, offering state-of-the-art capabilities alongside new benchmarks and evaluation metrics to advance the JAVG research community.

How It Works The core is a Diffusion Transformer model for joint audio and video synthesis. Its novelty lies in the Hierarchical Spatio-Temporal Prior Synchronization mechanism, ensuring temporal alignment and contextual relevance between generated audio and video. The project complements its model with JavisBench, a large-scale benchmark dataset, and JavisScore, a robust synchronization evaluation metric, facilitating standardized JAVG research.

Quick Start & Requirements Installation requires a Python 3.10 environment (e.g., Conda), cloning the repo, and installing dependencies via requirements/requirements-cu121.txt. FFmpeg is also needed (conda install -c conda-forge ffmpeg). The primary inference install command is pip install -v .. For performance, optional installation of apex and flash-attn is recommended, requiring CUDA 12.1. Pre-trained models are available on HuggingFace Models. Further details are on the HomePage and ArXiv.

Highlighted Details

Introduces JavisBench: a benchmark dataset of 10,140 high-quality, text-captioned sounding videos.
Features JavisScore: a novel metric for robust audio-video synchronization evaluation.
Offers pre-trained models like JavisDiT-v0.1-prior (29M params) and JavisDiT-v0.1 (3.4B params), supporting resolutions from 144P to 1080P.
Provides detailed multi-stage training instructions for audio pretraining, prior estimation, and joint audio-video synchronization.

Maintenance & Community Recent updates (December 2025) include integration into the JavisVerse project and the release of JavisGPT, a multi-modal LLM. The project aims to advance Joint Audio-Video Intelligence. No specific community channels (Discord/Slack) are detailed.

Licensing & Compatibility The README does not explicitly state the software license. Clarification is needed regarding usage rights, especially for commercial applications.

Limitations & Caveats The JavisDiT-v0.1 model is a preview version trained with limited resources. Developers are actively working to enhance generation quality by refining model architecture and training data.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days