Talking head synthesis research paper (CVPR 2024)
Top 27.5% on sourcepulse
SyncTalk synthesizes synchronized talking head videos, focusing on precise lip-sync and stable head poses. It targets researchers and developers in computer vision and graphics, offering high-resolution video generation with restored facial details.
How It Works
SyncTalk employs tri-plane hash representations for subject identity preservation. It leverages an Audio Visual Encoder (AVE) or other audio feature extractors (DeepSpeech, HuBERT) to capture speech synchronization. The system generates synchronized lip movements, facial expressions, and stable head poses, with an optional torso training module to address issues like double chins.
Quick Start & Requirements
python==3.8.8
), and install dependencies (torch==1.12.1+cu113
, tensorflow-gpu==2.8.1
, pytorch3d
, requirements.txt
).May.zip
, trial_may.zip
) and face/3DMM models. Video processing requires 25FPS, ~512x512 resolution, 4-5 min duration videos.Highlighted Details
Maintenance & Community
The project is associated with CVPR 2024. Recent updates include bug fixes for audio encoder, blendshape capture, and face tracker, along with Windows support and torso training.
Licensing & Compatibility
The repository does not explicitly state a license. The code is heavily reliant on other projects, some of which have permissive licenses (e.g., MIT). Users should verify licensing for commercial use.
Limitations & Caveats
The README notes that EmoTalk's blendshape capture is not open-source, and the provided mediapipe alternative may not perform as well. Torso training is incompatible with the --portrait
mode.
1 month ago
Inactive