Ovi  by character-ai

Cross-modal fusion for synchronized audio-video generation

Created 3 months ago
1,544 stars

Top 26.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Ovi is an open-source model for generating synchronized video and audio content from text or text+image inputs. It addresses the challenge of creating cohesive multimodal media by offering a unified approach that simultaneously produces both visual and auditory streams, beneficial for AI media generation researchers and developers.

How It Works

The model utilizes a "Twin Backbone Cross-Modal Fusion" architecture to process and generate video and audio concurrently, ensuring high temporal synchronization. It supports flexible conditioning on text alone or text+images, enabling diverse creative applications and fine-grained control.

Quick Start & Requirements

Installation involves cloning the repo, setting up a Python virtual environment, and installing dependencies via requirements.txt. Prerequisites include PyTorch (v2.5.1) and Flash Attention. Model weights must be downloaded separately. A Gradio app is provided for interaction. Minimum GPU VRAM is 32GB, reducible to 24GB with fp8 quantization and CPU offload, though these may slightly degrade quality and increase runtime.

Highlighted Details

  • Generates 5-second videos at 24 FPS, 720x720 resolution, with various aspect ratios.
  • Flexible input: text-only, text+image, and an 'i2v' mode using an image generation model for initial frames.
  • Advanced prompt formatting uses special tags for speech (<S>, <E>) and audio descriptions (<AUDCAP>, <ENDAUDCAP>).
  • Example prompts and GPT-assisted prompt creation are provided
Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
5
Star History
118 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.