MultiTalk by MeiGen-AI

Audio-driven multi-person conversational video generation

Created 7 months ago

2,764 stars

Top 17.1% on SourcePulse

Project Summary

MultiTalk is an open-source framework for generating audio-driven multi-person conversational videos. It allows users to create videos featuring multiple characters interacting, singing, or performing cartoon actions, driven by audio input and optional text prompts. The primary benefit is enabling realistic, synchronized multi-character video generation from audio.

How It Works

MultiTalk utilizes a novel framework that takes multi-stream audio, a reference image, and a prompt to generate videos. It focuses on achieving consistent lip synchronization with the audio and enabling direct virtual human control via prompts. The architecture supports realistic conversations, interactive character control, and generalization to cartoon characters and singing, with flexible resolution output.

Quick Start & Requirements

Installation: Requires creating a conda environment, installing PyTorch (cu121), xformers, flash-attn, and other dependencies via pip install -r requirements.txt. FFmpeg installation is also necessary.
Model Preparation: Involves downloading several models from Hugging Face (e.g., Wan2.1-I2V-14B-480P, chinese-wav2vec2-base, MeiGen-MultiTalk) and linking or copying the MeiGen-MultiTalk weights into the base model directory.
Hardware: Supports single-GPU inference (including low-VRAM configurations with num_persistent_param_in_dit 0), multi-GPU inference, and offers optimizations like TeaCache (2-3x speedup) and INT8 quantization.

Highlighted Details

Supports multi-person conversational video generation.
Enables interactive character control via prompts.
Offers generalization to cartoon characters and singing.
Provides resolution flexibility (480p, 720p) and supports up to 15-second video generation.
Includes optimizations like TeaCache, INT8 quantization, and LoRA acceleration (FusionX, lightx2v) for faster inference.

Maintenance & Community

The project has seen recent updates (July 2025) including INT8 quantization, SageAttention2.2, and FusionX LoRA support. Community contributions are highlighted, with integrations into platforms like Replicate, Gradio demos, and ComfyUI. A Google Colab example is also available.

Licensing & Compatibility

The models are licensed under the Apache 2.0 License. The license grants freedom to use generated content, provided usage complies with the license terms and applicable laws, prohibiting harmful, illegal, or misleading content.

Limitations & Caveats

While 720p inference is mentioned, the current code primarily supports 480p, with 720p requiring multiple GPUs. Longer video generation (beyond 81 frames) may reduce prompt-following performance. The project is actively being developed, with items like LCM distillation and a 1.3B model still on the todo list.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

64 stars in the last 30 days