Talking face animation via decoupled motion encoding (ACM MM 2024 paper)
Top 26.9% on sourcepulse
AniTalker provides an open-source implementation for animating talking faces from a single image and audio input, based on the ACM MM 2024 paper "AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding." It targets researchers and developers in computer vision and graphics looking to create realistic and controllable facial animations.
How It Works
AniTalker employs a two-stage approach. The first stage uses a motion encoder and image renderer to learn action transfer from a single image. The second stage leverages HuBERT or MFCC features for audio-to-motion synthesis, offering options for audio-only animation or control via head pose and face location/scale. The HuBERT-based models are recommended for better expressiveness and results.
Quick Start & Requirements
conda create -n anitalker python==3.9.0
, conda activate anitalker
, conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
, pip install -r requirements.txt
.transformers==4.19.2
. Face super-resolution requires facexlib
and tb-nightly
.ckpts/
directory.python ./code/demo.py --infer_type 'hubert_audio_only' --stage1_checkpoint_path 'ckpts/stage1.ckpt' --stage2_checkpoint_path 'ckpts/stage2_audio_only_hubert.ckpt' --test_image_path 'test_demos/portraits/monalisa.jpg' --test_audio_path 'test_demos/audios/monalisa.wav' --test_hubert_path 'test_demos/audios_hubert/monalisa.npy' --result_path 'outputs/monalisa_hubert/'
Highlighted Details
Maintenance & Community
The project has seen recent updates for Windows and MacOS tutorials, and a Huggingface Space. Community contributions are welcomed for tasks like UI development and translation.
Licensing & Compatibility
The repository is released under a permissive license, but the disclaimer notes it is not a formal product and is intended for academic demonstration. Use for spreading harmful information is strictly prohibited, and users are responsible for compliance with local laws.
Limitations & Caveats
The model is primarily trained on English speech and may exhibit issues with other languages. It is trained on frontal faces and may struggle with dramatic pose changes or non-speech scenarios. The dataset has demographic biases, and the model cannot effectively isolate accessories or hairstyles from the background. The authors have chosen not to release training scripts or additional checkpoints due to ethical concerns.
11 months ago
1+ week