hallo  by fudan-generative-vision

Audio-driven visual synthesis for portrait image animation

created 1 year ago
8,539 stars

Top 6.1% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Hallo addresses the challenge of animating portrait images based on audio input, enabling realistic lip-sync and head movements. It is designed for researchers and developers interested in generative AI, computer vision, and audio-visual synthesis, offering a powerful tool for creating dynamic visual content from static images and speech.

How It Works

Hallo employs a hierarchical approach to audio-driven visual synthesis. It leverages pre-trained models for face analysis, audio separation, and motion generation, integrating Stable Diffusion for visual synthesis. The system processes source images and driving audio, extracting key features like facial landmarks and motion cues to animate the portrait. This hierarchical structure allows for fine-grained control over different aspects of the animation, leading to more natural and expressive results.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n hallo python=3.10), activate it (conda activate hallo), install requirements (pip install -r requirements.txt), and then install the package (pip install .). FFmpeg is also required (apt-get install ffmpeg).
  • Prerequisites: Ubuntu 20.04/22.04, CUDA 12.1, tested GPUs include A100.
  • Pretrained Models: Download required models from HuggingFace (git clone https://huggingface.co/fudan-generative-ai/hallo pretrained_models).
  • Usage: Run inference via python scripts/inference.py --source_image <image_path> --driving_audio <audio_path>.
  • Resources: Showcase: https://github.com/fudan-generative-vision/hallo/assets/17402682/9d1a0de4-3470-4d38-9e4f-412f517f834c, Hugging Face Demo: https://huggingface.co/spaces/fudan-generative-vision/hallo

Highlighted Details

  • Supports training custom data with released training code.
  • Community contributions include Windows version, ComfyUI, WebUI, and Docker templates.
  • Input requirements: Square-cropped images (50-70% face focus, <30° rotation), WAV format English audio with clear vocals.
  • Offers fine-grained control over pose, face, and lip animation weights.

Maintenance & Community

The project has seen significant community engagement with various community-developed resources like WebUI, Windows support, and Docker images. A roadmap indicates ongoing work on improving Mandarin Chinese support.

Licensing & Compatibility

The repository does not explicitly state a license. However, it acknowledges contributions from other repositories which may have their own licenses. Users should verify licensing for commercial use.

Limitations & Caveats

The driving audio must be in English due to training data limitations. There is an open bug regarding sound volume affecting inference results (audio normalization). The project is actively developed, with some enhancements still in progress.

Health Check
Last commit

10 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
161 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

AudioGPT by AIGC-Audio

0.1%
10k
Audio processing and generation research project
created 2 years ago
updated 1 year ago
Feedback? Help us improve.