hallo by fudan-generative-vision

Audio-driven visual synthesis for portrait image animation

Created 1 year ago

8,623 stars

Top 6.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

Hallo addresses the challenge of animating portrait images based on audio input, enabling realistic lip-sync and head movements. It is designed for researchers and developers interested in generative AI, computer vision, and audio-visual synthesis, offering a powerful tool for creating dynamic visual content from static images and speech.

How It Works

Hallo employs a hierarchical approach to audio-driven visual synthesis. It leverages pre-trained models for face analysis, audio separation, and motion generation, integrating Stable Diffusion for visual synthesis. The system processes source images and driving audio, extracting key features like facial landmarks and motion cues to animate the portrait. This hierarchical structure allows for fine-grained control over different aspects of the animation, leading to more natural and expressive results.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (conda create -n hallo python=3.10), activate it (conda activate hallo), install requirements (pip install -r requirements.txt), and then install the package (pip install .). FFmpeg is also required (apt-get install ffmpeg).
Prerequisites: Ubuntu 20.04/22.04, CUDA 12.1, tested GPUs include A100.
Pretrained Models: Download required models from HuggingFace (git clone https://huggingface.co/fudan-generative-ai/hallo pretrained_models).
Usage: Run inference via python scripts/inference.py --source_image <image_path> --driving_audio <audio_path>.
Resources: Showcase: https://github.com/fudan-generative-vision/hallo/assets/17402682/9d1a0de4-3470-4d38-9e4f-412f517f834c, Hugging Face Demo: https://huggingface.co/spaces/fudan-generative-vision/hallo

Highlighted Details

Supports training custom data with released training code.
Community contributions include Windows version, ComfyUI, WebUI, and Docker templates.
Input requirements: Square-cropped images (50-70% face focus, <30° rotation), WAV format English audio with clear vocals.
Offers fine-grained control over pose, face, and lip animation weights.

Maintenance & Community

The project has seen significant community engagement with various community-developed resources like WebUI, Windows support, and Docker images. A roadmap indicates ongoing work on improving Mandarin Chinese support.

Licensing & Compatibility

The repository does not explicitly state a license. However, it acknowledges contributions from other repositories which may have their own licenses. Users should verify licensing for commercial use.

Limitations & Caveats

The driving audio must be in English due to training data limitations. There is an open bug regarding sound volume affecting inference results (audio normalization). The project is actively developed, with some enhancements still in progress.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

17 stars in the last 30 days