OmniCustom  by OmniCustom-project

Multimodal AI for synchronized audio-video generation

Created 2 months ago
420 stars

Top 70.0% on SourcePulse

GitHubView on GitHub
Project Summary

OmniCustom is an open-source framework for synchronized audio-video customization, enabling users to generate videos that precisely match a reference image's visual identity and a reference audio's timbre, while allowing the speech content to be freely specified via text prompts. It targets researchers and developers in generative AI for media synthesis, offering a novel approach to controllable and synchronized AV content creation.

How It Works

OmniCustom employs a joint audio-video generation model. It takes a reference image and audio as input, preserving their respective visual and auditory characteristics. A textual prompt then dictates the speech content to be synthesized. The framework leverages and integrates several pre-trained models, including OVI for base generation, Naturalspeech 3 for timbre embeddings, InsightFace for face embeddings, and LivePortrait for reference image cropping, to achieve synchronized and customized AV output.

Quick Start & Requirements

  • Installation: Clone the repository, create a Conda environment (python=3.10), install requirements (pip install -r requirements.txt), and install Flash Attention (pip install flash-attn --no-build-isolation).
  • Model Download: Utilize download_weights.py and huggingface-cli download to obtain necessary checkpoints (OmniCustom, Naturalspeech 3, InsightFace, LivePortrait, MMAudio, Wan2.2) and place them in the ckpts/ directory.
  • Configuration: Modify parameters in OmniCustom/configs/inference/inference_fusion.yaml to control generation quality, resolution, and input balancing.
  • Inference: Execute bash ./inference.sh or run infer.py with specified configurations.
  • Prerequisites: Python 3.10, Conda, and significant GPU resources.
  • Hardware: Requires 80 GB of VRAM on a single GPU for inference.
  • Links: Project page: https://OmniCustom-project.github.io/page/

Highlighted Details

  • Synchronous generation of video and audio with customizable content.
  • Preserves visual identity from a reference image and audio timbre from a reference audio.
  • Speech content is controllable via textual prompts.
  • Integrates multiple specialized models for comprehensive AV synthesis.

Maintenance & Community

The provided README does not contain information regarding notable contributors, sponsorships, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

The license for the OmniCustom project is not specified in the README. This omission prevents an assessment of its compatibility for commercial use or integration into closed-source projects.

Limitations & Caveats

Currently, only inference codes and model checkpoints are publicly available; training codes and an evaluation benchmark are listed as future open-source targets. The substantial 80 GB VRAM requirement presents a significant barrier to entry for users without high-end hardware. The absence of a specified license is a critical adoption blocker.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
64 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral) and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

AudioLDM by haoheliu

0.2%
3k
Audio generation research paper using latent diffusion
Created 3 years ago
Updated 9 months ago
Feedback? Help us improve.