Discover and explore top open-source AI tools and projects—updated daily.
JavisVerseSynchronized audio-video generation from text
Top 86.8% on SourcePulse
Summary JavisDiT addresses synchronized audio-video generation (JAVG) from text prompts using a novel Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization. It aims to overcome key bottlenecks in generating coherent content, offering state-of-the-art capabilities alongside new benchmarks and evaluation metrics to advance the JAVG research community.
How It Works The core is a Diffusion Transformer model for joint audio and video synthesis. Its novelty lies in the Hierarchical Spatio-Temporal Prior Synchronization mechanism, ensuring temporal alignment and contextual relevance between generated audio and video. The project complements its model with JavisBench, a large-scale benchmark dataset, and JavisScore, a robust synchronization evaluation metric, facilitating standardized JAVG research.
Quick Start & Requirements
Installation requires a Python 3.10 environment (e.g., Conda), cloning the repo, and installing dependencies via requirements/requirements-cu121.txt. FFmpeg is also needed (conda install -c conda-forge ffmpeg). The primary inference install command is pip install -v .. For performance, optional installation of apex and flash-attn is recommended, requiring CUDA 12.1. Pre-trained models are available on HuggingFace Models. Further details are on the HomePage and ArXiv.
Highlighted Details
Maintenance & Community Recent updates (December 2025) include integration into the JavisVerse project and the release of JavisGPT, a multi-modal LLM. The project aims to advance Joint Audio-Video Intelligence. No specific community channels (Discord/Slack) are detailed.
Licensing & Compatibility The README does not explicitly state the software license. Clarification is needed regarding usage rights, especially for commercial applications.
Limitations & Caveats
The JavisDiT-v0.1 model is a preview version trained with limited resources. Developers are actively working to enhance generation quality by refining model architecture and training data.
1 week ago
Inactive
Alpha-VLLM