HuMo by Phantom-video

Human-centric video generation framework

Created 4 months ago

1,065 stars

Top 35.5% on SourcePulse

Project Summary

HuMo is a unified, human-centric video generation framework designed for producing high-quality, fine-grained, and controllable human videos from multimodal inputs, including text, images, and audio. It targets researchers and developers seeking advanced control over human video synthesis, offering benefits such as strong text prompt following, consistent subject preservation, and synchronized audio-driven motion.

How It Works

The framework utilizes a collaborative multi-modal conditioning approach, integrating text, image, and audio inputs to guide video generation. This allows for precise control over character appearance, clothing, makeup, props, and scenes using text prompts combined with reference images. It also enables audio-synchronized human motion generation solely from text and audio inputs, removing the need for image references and offering greater creative freedom. The architecture is designed for high-quality, fine-grained, and controllable human video synthesis.

Quick Start & Requirements

Installation: Requires creating a conda environment (conda create -n humo python=3.11), activating it, and installing specific versions of PyTorch (torch==2.5.1, torchvision==0.20.1, torchaudio==2.5.1 with CUDA 12.4 support), flash_attn==2.6.3, and other dependencies via requirements.txt. ffmpeg is also required (conda install -c conda-forge ffmpeg).
Model Preparation: Essential models include HuMo-17B, Wan-2.1 (VAE & Text encoder), and Whisper-large-v3 (Audio encoder). An optional audio separator is also available. All models are hosted on Huggingface and can be downloaded using huggingface-cli commands provided in the README.
Prerequisites: Python 3.11, PyTorch 2.5.1 with CUDA 12.4, flash_attn, and ffmpeg.
Resource Footprint: Not explicitly detailed, but the 17B model suggests significant GPU memory requirements for inference. Multi-GPU inference is supported.

Highlighted Details

Multi-Modal Generation: Offers distinct modes: Text-Image for visual customization, Text-Audio for audio-driven motion without image references, and Text-Image-Audio for comprehensive control.
Human-Centric Control: Features strong text prompt following, consistent subject preservation, and synchronized audio-driven motion.
Resolution & Quality: Supports 480P and 720P resolutions, with 720P yielding superior quality.
Configurability: Generation parameters like frame count, guidance scales (text, audio), and video resolution are adjustable via generate.yaml.

Maintenance & Community

Development: A project by Tsinghua University and ByteDance.
Contact: Primary contacts are Liyang Chen and Tianxiang Ma.
Support: Users can open issues for questions or comments. The project paper is available on arXiv.

Licensing & Compatibility

The license type is not specified in the provided README.

Limitations & Caveats

Generation Length: Optimal performance is limited to 97 frames; longer videos may degrade quality. A new checkpoint for extended generation is planned.
Model Availability: The HuMo-1.7B model checkpoint is slated for future release.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

2

Issues (30d)

9

Star History

69 stars in the last 30 days

Explore Similar Projects

LongVie by Vchitect

Multimodal world model for ultra-long video generation

Created 5 months ago

Updated 4 days ago

Ai-movie-clip by LumingMelody

AI system for automated video editing

Created 5 months ago

Updated 1 month ago

JJYB_AI_VideoAutoCut by jianjieyiban

AI video editing and content generation suite

Created 1 month ago

Updated 1 month ago

guizang-s-prompt by op7418

AI prompts for multimodal content generation

Created 1 week ago

Updated 5 days ago

VideoX-Fun by aigc-apps

Flexible framework for advanced AI video generation

Created 1 year ago

Updated 5 days ago

Google-Colab_Notebooks by Isi-dev

Diverse Google Colab notebooks for AI media generation

Created 1 year ago

Updated 3 weeks ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI) and

Jonathan Ragan-Kelley

Jonathan Ragan-Kelley(Professor at MIT).

Ovi by character-ai

Cross-modal fusion for synchronized audio-video generation

Created 3 months ago

Updated 1 month ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI).

LTX-2 by Lightricks

DiT-based audio-video foundation model for generative tasks

Created 1 week ago

Updated 3 days ago

Pixelle-Video by AIDC-AI

AI engine for fully automated short video creation

Created 2 months ago

Updated 3 days ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

MMAudio by hkchengrex

Synthesize high-quality audio from video and text

Created 1 year ago

Updated 1 month ago

Ultralight-Digital-Human by anliyuan

Digital human model for mobile, real-time use

Created 1 year ago

Updated 3 months ago

MultiTalk by MeiGen-AI

Audio-driven multi-person conversational video generation

Created 7 months ago

Updated 3 weeks ago

Feedback? Help us improve.