OmniCustom by OmniCustom-project

Multimodal AI for synchronized audio-video generation

Created 5 months ago

423 stars

Top 69.1% on SourcePulse

Project Summary

OmniCustom is an open-source framework for synchronized audio-video customization, enabling users to generate videos that precisely match a reference image's visual identity and a reference audio's timbre, while allowing the speech content to be freely specified via text prompts. It targets researchers and developers in generative AI for media synthesis, offering a novel approach to controllable and synchronized AV content creation.

How It Works

OmniCustom employs a joint audio-video generation model. It takes a reference image and audio as input, preserving their respective visual and auditory characteristics. A textual prompt then dictates the speech content to be synthesized. The framework leverages and integrates several pre-trained models, including OVI for base generation, Naturalspeech 3 for timbre embeddings, InsightFace for face embeddings, and LivePortrait for reference image cropping, to achieve synchronized and customized AV output.

Quick Start & Requirements

Installation: Clone the repository, create a Conda environment (python=3.10), install requirements (pip install -r requirements.txt), and install Flash Attention (pip install flash-attn --no-build-isolation).
Model Download: Utilize download_weights.py and huggingface-cli download to obtain necessary checkpoints (OmniCustom, Naturalspeech 3, InsightFace, LivePortrait, MMAudio, Wan2.2) and place them in the ckpts/ directory.
Configuration: Modify parameters in OmniCustom/configs/inference/inference_fusion.yaml to control generation quality, resolution, and input balancing.
Inference: Execute bash ./inference.sh or run infer.py with specified configurations.
Prerequisites: Python 3.10, Conda, and significant GPU resources.
Hardware: Requires 80 GB of VRAM on a single GPU for inference.
Links: Project page: https://OmniCustom-project.github.io/page/

Highlighted Details

Synchronous generation of video and audio with customizable content.
Preserves visual identity from a reference image and audio timbre from a reference audio.
Speech content is controllable via textual prompts.
Integrates multiple specialized models for comprehensive AV synthesis.

Maintenance & Community

The provided README does not contain information regarding notable contributors, sponsorships, community channels (e.g., Discord, Slack), or a public roadmap.

Licensing & Compatibility

The license for the OmniCustom project is not specified in the README. This omission prevents an assessment of its compatibility for commercial use or integration into closed-source projects.

Limitations & Caveats

Currently, only inference codes and model checkpoints are publicly available; training codes and an evaluation benchmark are listed as future open-source targets. The substantial 80 GB VRAM requirement presents a significant barrier to entry for users without high-end hardware. The absence of a specified license is a critical adoption blocker.

OmniCustom by OmniCustom-project

Explore Similar Projects

SongGen by LiuZH-19

JavisDiT by JavisVerse

WavJourney by Audio-AGI

awesome-ai-voice by wildminder

SkyReels-V3 by SkyworkAI

stable-audio-3 by Stability-AI

MOVA by OpenMOSS

JoyAI-Echo by jd-opensource

AudioLDM2 by haoheliu

MMAudio by hkchengrex

AudioLDM by haoheliu

Pixelle-Video by ATH-MaaS