Discover and explore top open-source AI tools and projects—updated daily.
Human-centric video generation framework
New!
Top 61.2% on SourcePulse
HuMo is a unified, human-centric video generation framework designed for producing high-quality, fine-grained, and controllable human videos from multimodal inputs, including text, images, and audio. It targets researchers and developers seeking advanced control over human video synthesis, offering benefits such as strong text prompt following, consistent subject preservation, and synchronized audio-driven motion.
How It Works
The framework utilizes a collaborative multi-modal conditioning approach, integrating text, image, and audio inputs to guide video generation. This allows for precise control over character appearance, clothing, makeup, props, and scenes using text prompts combined with reference images. It also enables audio-synchronized human motion generation solely from text and audio inputs, removing the need for image references and offering greater creative freedom. The architecture is designed for high-quality, fine-grained, and controllable human video synthesis.
Quick Start & Requirements
conda
environment (conda create -n humo python=3.11
), activating it, and installing specific versions of PyTorch (torch==2.5.1
, torchvision==0.20.1
, torchaudio==2.5.1
with CUDA 12.4 support), flash_attn==2.6.3
, and other dependencies via requirements.txt
. ffmpeg
is also required (conda install -c conda-forge ffmpeg
).huggingface-cli
commands provided in the README.flash_attn
, and ffmpeg
.Highlighted Details
generate.yaml
.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 day ago
Inactive