HuMo  by Phantom-video

Human-centric video generation framework

Created 1 week ago

New!

511 stars

Top 61.2% on SourcePulse

GitHubView on GitHub
Project Summary

HuMo is a unified, human-centric video generation framework designed for producing high-quality, fine-grained, and controllable human videos from multimodal inputs, including text, images, and audio. It targets researchers and developers seeking advanced control over human video synthesis, offering benefits such as strong text prompt following, consistent subject preservation, and synchronized audio-driven motion.

How It Works

The framework utilizes a collaborative multi-modal conditioning approach, integrating text, image, and audio inputs to guide video generation. This allows for precise control over character appearance, clothing, makeup, props, and scenes using text prompts combined with reference images. It also enables audio-synchronized human motion generation solely from text and audio inputs, removing the need for image references and offering greater creative freedom. The architecture is designed for high-quality, fine-grained, and controllable human video synthesis.

Quick Start & Requirements

  • Installation: Requires creating a conda environment (conda create -n humo python=3.11), activating it, and installing specific versions of PyTorch (torch==2.5.1, torchvision==0.20.1, torchaudio==2.5.1 with CUDA 12.4 support), flash_attn==2.6.3, and other dependencies via requirements.txt. ffmpeg is also required (conda install -c conda-forge ffmpeg).
  • Model Preparation: Essential models include HuMo-17B, Wan-2.1 (VAE & Text encoder), and Whisper-large-v3 (Audio encoder). An optional audio separator is also available. All models are hosted on Huggingface and can be downloaded using huggingface-cli commands provided in the README.
  • Prerequisites: Python 3.11, PyTorch 2.5.1 with CUDA 12.4, flash_attn, and ffmpeg.
  • Resource Footprint: Not explicitly detailed, but the 17B model suggests significant GPU memory requirements for inference. Multi-GPU inference is supported.

Highlighted Details

  • Multi-Modal Generation: Offers distinct modes: Text-Image for visual customization, Text-Audio for audio-driven motion without image references, and Text-Image-Audio for comprehensive control.
  • Human-Centric Control: Features strong text prompt following, consistent subject preservation, and synchronized audio-driven motion.
  • Resolution & Quality: Supports 480P and 720P resolutions, with 720P yielding superior quality.
  • Configurability: Generation parameters like frame count, guidance scales (text, audio), and video resolution are adjustable via generate.yaml.

Maintenance & Community

  • Development: A project by Tsinghua University and ByteDance.
  • Contact: Primary contacts are Liyang Chen and Tianxiang Ma.
  • Support: Users can open issues for questions or comments. The project paper is available on arXiv.

Licensing & Compatibility

  • The license type is not specified in the provided README.

Limitations & Caveats

  • Generation Length: Optimal performance is limited to 97 frames; longer videos may degrade quality. A new checkpoint for extended generation is planned.
  • Model Availability: The HuMo-1.7B model checkpoint is slated for future release.
Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
19
Star History
521 stars in the last 9 days

Explore Similar Projects

Starred by Jiaming Song Jiaming Song(Chief Scientist at Luma AI), Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI), and
1 more.

Lumina-T2X by Alpha-VLLM

0.0%
2k
Framework for text-to-any modality generation
Created 1 year ago
Updated 7 months ago
Feedback? Help us improve.