SyncTalk by ZiqiaoPeng

Talking head synthesis research paper (CVPR 2024)

Created 2 years ago

1,608 stars

Top 25.9% on SourcePulse

Project Summary

SyncTalk synthesizes synchronized talking head videos, focusing on precise lip-sync and stable head poses. It targets researchers and developers in computer vision and graphics, offering high-resolution video generation with restored facial details.

How It Works

SyncTalk employs tri-plane hash representations for subject identity preservation. It leverages an Audio Visual Encoder (AVE) or other audio feature extractors (DeepSpeech, HuBERT) to capture speech synchronization. The system generates synchronized lip movements, facial expressions, and stable head poses, with an optional torso training module to address issues like double chins.

Quick Start & Requirements

Installation: Clone the repository, create a Conda environment (python==3.8.8), and install dependencies (torch==1.12.1+cu113, tensorflow-gpu==2.8.1, pytorch3d, requirements.txt).
Prerequisites: Linux (Ubuntu 18.04 tested), CUDA 11.3, PyTorch 1.12.1, Python 3.8.8. Windows support is available via a pre-packaged download.
Data: Requires downloading pre-trained models (May.zip, trial_may.zip) and face/3DMM models. Video processing requires 25FPS, ~512x512 resolution, 4-5 min duration videos.
Links: Code Colab notebook, Windows Download

Highlighted Details

Achieves PSNR of 37.644 and LPIPS of 0.0117 with portrait mode.
Supports multiple audio feature extractors: AVE, DeepSpeech, and HuBERT.
Introduced torso training to resolve double chin artifacts, though it disables portrait mode.
Offers optional AU45 (eye blinking) integration via OpenFace.

Maintenance & Community

The project is associated with CVPR 2024. Recent updates include bug fixes for audio encoder, blendshape capture, and face tracker, along with Windows support and torso training.

Licensing & Compatibility

The repository does not explicitly state a license. The code is heavily reliant on other projects, some of which have permissive licenses (e.g., MIT). Users should verify licensing for commercial use.

Limitations & Caveats

The README notes that EmoTalk's blendshape capture is not open-source, and the provided mediapipe alternative may not perform as well. Torso training is incompatible with the --portrait mode.

Health Check

Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days