JavisDiT  by JavisVerse

Synchronized audio-video generation from text

Created 10 months ago
311 stars

Top 86.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary JavisDiT addresses synchronized audio-video generation (JAVG) from text prompts using a novel Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization. It aims to overcome key bottlenecks in generating coherent content, offering state-of-the-art capabilities alongside new benchmarks and evaluation metrics to advance the JAVG research community.

How It Works The core is a Diffusion Transformer model for joint audio and video synthesis. Its novelty lies in the Hierarchical Spatio-Temporal Prior Synchronization mechanism, ensuring temporal alignment and contextual relevance between generated audio and video. The project complements its model with JavisBench, a large-scale benchmark dataset, and JavisScore, a robust synchronization evaluation metric, facilitating standardized JAVG research.

Quick Start & Requirements Installation requires a Python 3.10 environment (e.g., Conda), cloning the repo, and installing dependencies via requirements/requirements-cu121.txt. FFmpeg is also needed (conda install -c conda-forge ffmpeg). The primary inference install command is pip install -v .. For performance, optional installation of apex and flash-attn is recommended, requiring CUDA 12.1. Pre-trained models are available on HuggingFace Models. Further details are on the HomePage and ArXiv.

Highlighted Details

  • Introduces JavisBench: a benchmark dataset of 10,140 high-quality, text-captioned sounding videos.
  • Features JavisScore: a novel metric for robust audio-video synchronization evaluation.
  • Offers pre-trained models like JavisDiT-v0.1-prior (29M params) and JavisDiT-v0.1 (3.4B params), supporting resolutions from 144P to 1080P.
  • Provides detailed multi-stage training instructions for audio pretraining, prior estimation, and joint audio-video synchronization.

Maintenance & Community Recent updates (December 2025) include integration into the JavisVerse project and the release of JavisGPT, a multi-modal LLM. The project aims to advance Joint Audio-Video Intelligence. No specific community channels (Discord/Slack) are detailed.

Licensing & Compatibility The README does not explicitly state the software license. Clarification is needed regarding usage rights, especially for commercial applications.

Limitations & Caveats The JavisDiT-v0.1 model is a preview version trained with limited resources. Developers are actively working to enhance generation quality by refining model architecture and training data.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Jiaming Song Jiaming Song(Chief Scientist at Luma AI), Alex Yu Alex Yu(Research Scientist at OpenAI; Cofounder of Luma AI), and
1 more.

Lumina-T2X by Alpha-VLLM

0.2%
2k
Framework for text-to-any modality generation
Created 1 year ago
Updated 10 months ago
Feedback? Help us improve.