Research paper for portrait image animation via video diffusion transformer
Top 31.7% on sourcepulse
Hallo3 enables highly dynamic and realistic portrait image animation driven by audio, targeting researchers and developers in generative AI and computer vision. It leverages a Video Diffusion Transformer architecture to achieve state-of-the-art results in animating static portraits based on speech.
How It Works
Hallo3 utilizes a Video Diffusion Transformer (VDT) model, building upon the CogVideo-5B I2V architecture. This approach allows for the generation of high-fidelity, temporally coherent video sequences from a single image and an audio input. The VDT's transformer backbone is adept at capturing long-range dependencies in video, crucial for realistic motion and expression synthesis, while the diffusion process ensures high-quality visual output.
Quick Start & Requirements
conda create -n hallo python=3.10
), activate it, and install requirements (pip install -r requirements.txt
). Also requires ffmpeg
(apt-get install ffmpeg
).huggingface-cli download fudan-generative-ai/hallo3 --local-dir ./pretrained_models
). Requires models for audio separation, text encoding, face analysis, and the core VDT.python hallo3/app.py
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
4 months ago
1+ week