Ovi by character-ai

Cross-modal fusion for synchronized audio-video generation

Created 5 months ago

1,640 stars

Top 25.3% on SourcePulse

View on GitHub

2 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Jonathan Ragan-Kelley

Professor at MIT

Project Summary

Summary

Ovi is an open-source model for generating synchronized video and audio content from text or text+image inputs. It addresses the challenge of creating cohesive multimodal media by offering a unified approach that simultaneously produces both visual and auditory streams, beneficial for AI media generation researchers and developers.

How It Works

The model utilizes a "Twin Backbone Cross-Modal Fusion" architecture to process and generate video and audio concurrently, ensuring high temporal synchronization. It supports flexible conditioning on text alone or text+images, enabling diverse creative applications and fine-grained control.

Quick Start & Requirements

Installation involves cloning the repo, setting up a Python virtual environment, and installing dependencies via requirements.txt. Prerequisites include PyTorch (v2.5.1) and Flash Attention. Model weights must be downloaded separately. A Gradio app is provided for interaction. Minimum GPU VRAM is 32GB, reducible to 24GB with fp8 quantization and CPU offload, though these may slightly degrade quality and increase runtime.

Highlighted Details

Generates 5-second videos at 24 FPS, 720x720 resolution, with various aspect ratios.
Flexible input: text-only, text+image, and an 'i2v' mode using an image generation model for initial frames.
Advanced prompt formatting uses special tags for speech (<S>, <E>) and audio descriptions (<AUDCAP>, <ENDAUDCAP>).
Example prompts and GPT-assisted prompt creation are provided

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

69 stars in the last 30 days