Talking-head video synthesis via phoneme-pose dictionary (ICASSP 2022 paper)
Top 69.3% on sourcepulse
This repository provides code for Text2Video, a method for synthesizing talking-head videos from text. It is designed for researchers and developers in computer vision and graphics interested in novel video generation techniques. The approach offers advantages over audio-driven methods by requiring less training data, exhibiting greater flexibility, and reducing preprocessing and inference times.
How It Works
The core of the method involves building a phoneme-pose dictionary and training a Generative Adversarial Network (GAN) to synthesize video from interpolated phoneme poses. This text-driven approach leverages a phonetic dictionary to map text to corresponding facial poses, enabling more direct control and efficiency compared to methods relying solely on audio input.
Quick Start & Requirements
sox
, libsox-fmt-mp3
, zhon
, moviepy
, ffmpeg
, dominate
, pydub
, and vosk
(with a language model).git
, sox
, ffmpeg
, vosk
(for Chinese), and potentially montreal-forced-aligner
. CUDA-enabled GPUs are recommended for training and inference.Highlighted Details
Maintenance & Community
The project is associated with ICASSP 2022. No specific community channels or active maintenance indicators are present in the README.
Licensing & Compatibility
The repository is released under a permissive license, but it is based on the NVIDIA vid2vid framework, which may have its own licensing terms. Compatibility for commercial use should be verified against the underlying vid2vid license.
Limitations & Caveats
The setup process involves multiple external dependencies and pre-trained model downloads. Training from scratch requires significant data preparation and computational resources. The project is research-oriented, and production-readiness is not explicitly stated.
2 years ago
Inactive