AniTalker by X-LANCE

Talking face animation via decoupled motion encoding (ACM MM 2024 paper)

Created 1 year ago

1,604 stars

Top 25.9% on SourcePulse

Project Summary

AniTalker provides an open-source implementation for animating talking faces from a single image and audio input, based on the ACM MM 2024 paper "AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding." It targets researchers and developers in computer vision and graphics looking to create realistic and controllable facial animations.

How It Works

AniTalker employs a two-stage approach. The first stage uses a motion encoder and image renderer to learn action transfer from a single image. The second stage leverages HuBERT or MFCC features for audio-to-motion synthesis, offering options for audio-only animation or control via head pose and face location/scale. The HuBERT-based models are recommended for better expressiveness and results.

Quick Start & Requirements

Install: conda create -n anitalker python==3.9.0, conda activate anitalker, conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge, pip install -r requirements.txt.
Prerequisites: Python 3.9, PyTorch 1.8.0, CUDA 11.1. HuBERT features require additional transformers==4.19.2. Face super-resolution requires facexlib and tb-nightly.
Models: Checkpoints must be downloaded separately and placed in the ckpts/ directory.
Demo: python ./code/demo.py --infer_type 'hubert_audio_only' --stage1_checkpoint_path 'ckpts/stage1.ckpt' --stage2_checkpoint_path 'ckpts/stage2_audio_only_hubert.ckpt' --test_image_path 'test_demos/portraits/monalisa.jpg' --test_audio_path 'test_demos/audios/monalisa.wav' --test_hubert_path 'test_demos/audios_hubert/monalisa.npy' --result_path 'outputs/monalisa_hubert/'
Resources: Tutorials for Windows and MacOS are available. A Huggingface Space and Web UI are also provided.

Highlighted Details

Offers both HuBERT and MFCC feature extraction for audio processing, with HuBERT recommended for superior results.
Supports audio-only animation and controllable animation with head pose and face location/scale.
Includes an optional face super-resolution module (e.g., GFPGAN) for upscaling output to 512x512.
Best practices emphasize using similar poses, centered heads, English speech, and forward gaze for optimal results.

Maintenance & Community

The project has seen recent updates for Windows and MacOS tutorials, and a Huggingface Space. Community contributions are welcomed for tasks like UI development and translation.

Licensing & Compatibility

The repository is released under a permissive license, but the disclaimer notes it is not a formal product and is intended for academic demonstration. Use for spreading harmful information is strictly prohibited, and users are responsible for compliance with local laws.

Limitations & Caveats

The model is primarily trained on English speech and may exhibit issues with other languages. It is trained on frontal faces and may struggle with dramatic pose changes or non-speech scenarios. The dataset has demographic biases, and the model cannot effectively isolate accessories or hairstyles from the background. The authors have chosen not to release training scripts or additional checkpoints due to ethical concerns.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days