hallo3  by fudan-generative-vision

Research paper for portrait image animation via video diffusion transformer

created 8 months ago
1,284 stars

Top 31.7% on sourcepulse

GitHubView on GitHub
Project Summary

Hallo3 enables highly dynamic and realistic portrait image animation driven by audio, targeting researchers and developers in generative AI and computer vision. It leverages a Video Diffusion Transformer architecture to achieve state-of-the-art results in animating static portraits based on speech.

How It Works

Hallo3 utilizes a Video Diffusion Transformer (VDT) model, building upon the CogVideo-5B I2V architecture. This approach allows for the generation of high-fidelity, temporally coherent video sequences from a single image and an audio input. The VDT's transformer backbone is adept at capturing long-range dependencies in video, crucial for realistic motion and expression synthesis, while the diffusion process ensures high-quality visual output.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n hallo python=3.10), activate it, and install requirements (pip install -r requirements.txt). Also requires ffmpeg (apt-get install ffmpeg).
  • Pretrained Models: Download from HuggingFace (huggingface-cli download fudan-generative-ai/hallo3 --local-dir ./pretrained_models). Requires models for audio separation, text encoding, face analysis, and the core VDT.
  • Inference Data: Reference image (1:1 or 3:2 aspect ratio), WAV audio (English, clear vocals).
  • Demo: Gradio UI via python hallo3/app.py.
  • Docs: Project page linked in README.

Highlighted Details

  • Accepted to CVPR 2025.
  • Released over 70 hours of talking-head videos and 50 hours of dynamic clips for training data.
  • Fine-tuned derivative of CogVideo-5B I2V model.
  • Supports batch inference via provided scripts.

Maintenance & Community

  • Developed by Fudan University and Baidu Inc.
  • No explicit community links (Discord/Slack) or roadmap provided in the README.

Licensing & Compatibility

  • The project is a derivative work of CogVideo-5B, which is open-source. The use, distribution, and modification of Hallo3 must comply with the CogVideo-5B LICENSE. Specific terms of the CogVideo-5B license are not detailed in this README.

Limitations & Caveats

  • Audio input is restricted to English due to training data limitations.
  • Potential social risks related to deepfakes and misuse are acknowledged, with a call for ethical guidelines and responsible use.
Health Check
Last commit

4 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
91 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.