hallo2  by fudan-generative-vision

Audio-driven portrait animation for long durations and high resolutions

created 9 months ago
3,604 stars

Top 13.7% on sourcepulse

GitHubView on GitHub
Project Summary

Hallo2 is an open-source project for generating long-duration, high-resolution portrait animations driven by audio. It targets researchers and developers in AI-driven media synthesis, offering a solution for creating realistic talking head videos from static images and audio inputs.

How It Works

Hallo2 employs a diffusion-based approach, leveraging a UNet architecture for denoising. It integrates multiple specialized models for face analysis, motion generation, and audio processing. The system processes input images and audio to generate synchronized facial movements and expressions, with an optional super-resolution module for enhanced output quality.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n hallo python=3.10), activate it, install PyTorch (pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118), install requirements (pip install -r requirements.txt), and install ffmpeg (apt-get install ffmpeg).
  • Pretrained Models: Download from HuggingFace (huggingface-cli download fudan-generative-ai/hallo2 --local-dir ./pretrained_models).
  • System Requirements: Ubuntu 20.04/22.04, CUDA 11.8, tested on A100 GPUs.
  • Input Data: Square-cropped images (face 50-70% of image, <30° rotation), WAV audio (English, clear vocals).
  • Inference: python scripts/inference_long.py --config ./configs/inference/long.yaml for long-duration, python scripts/video_sr.py --input_path [input_video] --output_path [output_dir] for high-resolution.
  • Links: Project page (not provided), Paper (arXiv:2410.07718).

Highlighted Details

  • Supports long-duration (up to 1 hour) and high-resolution (4K) video generation.
  • Demonstrates capabilities with speeches from Tailor Swift, Johan Rockstrom, and Churchill.
  • Includes a roadmap with completed paper submission and code release.
  • Offers training scripts for both long-duration and high-resolution animation.

Maintenance & Community

  • Paper accepted to ICLR 2025.
  • Source code and pretrained weights released October 2024.
  • Open research positions available at Fudan University.

Licensing & Compatibility

  • The high-resolution animation feature is under the S-Lab License 1.0. Other components' licenses are not explicitly stated but appear to be permissive, given the open-source release.

Limitations & Caveats

  • Audio driving is limited to English due to training data constraints.
  • The project acknowledges social risks related to deepfakes and privacy, emphasizing ethical guidelines.
Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
60 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.