Discover and explore top open-source AI tools and projects—updated daily.
character-aiCross-modal fusion for synchronized audio-video generation
Top 34.1% on SourcePulse
Summary
Ovi is an open-source model for generating synchronized video and audio content from text or text+image inputs. It addresses the challenge of creating cohesive multimodal media by offering a unified approach that simultaneously produces both visual and auditory streams, beneficial for AI media generation researchers and developers.
How It Works
The model utilizes a "Twin Backbone Cross-Modal Fusion" architecture to process and generate video and audio concurrently, ensuring high temporal synchronization. It supports flexible conditioning on text alone or text+images, enabling diverse creative applications and fine-grained control.
Quick Start & Requirements
Installation involves cloning the repo, setting up a Python virtual environment, and installing dependencies via requirements.txt. Prerequisites include PyTorch (v2.5.1) and Flash Attention. Model weights must be downloaded separately. A Gradio app is provided for interaction. Minimum GPU VRAM is 32GB, reducible to 24GB with fp8 quantization and CPU offload, though these may slightly degrade quality and increase runtime.
Highlighted Details
<S>, <E>) and audio descriptions (<AUDCAP>, <ENDAUDCAP>).3 weeks ago
Inactive
harry0703