Text2Video by michaelzhang-ai

Talking-head video synthesis via phoneme-pose dictionary (ICASSP 2022 paper)

Created 4 years ago

440 stars

Top 67.9% on SourcePulse

Project Summary

This repository provides code for Text2Video, a method for synthesizing talking-head videos from text. It is designed for researchers and developers in computer vision and graphics interested in novel video generation techniques. The approach offers advantages over audio-driven methods by requiring less training data, exhibiting greater flexibility, and reducing preprocessing and inference times.

How It Works

The core of the method involves building a phoneme-pose dictionary and training a Generative Adversarial Network (GAN) to synthesize video from interpolated phoneme poses. This text-driven approach leverages a phonetic dictionary to map text to corresponding facial poses, enabling more direct control and efficiency compared to methods relying solely on audio input.

Quick Start & Requirements

Installation: Clone the repository and install dependencies including sox, libsox-fmt-mp3, zhon, moviepy, ffmpeg, dominate, pydub, and vosk (with a language model).
Prerequisites: Python, git, sox, ffmpeg, vosk (for Chinese), and potentially montreal-forced-aligner. CUDA-enabled GPUs are recommended for training and inference.
Setup: Requires downloading pre-trained models and organizing data into specific folder structures.
Resources: Training involves significant GPU resources. Inference requires pre-trained models.
Documentation: Project page and example scripts are available.

Highlighted Details

Text-driven synthesis with a phonetic dictionary.
GAN-based approach for video generation.
Reduced training data requirements compared to audio-driven methods.
Supports both English and Chinese languages.

Maintenance & Community

The project is associated with ICASSP 2022. No specific community channels or active maintenance indicators are present in the README.

Licensing & Compatibility

The repository is released under a permissive license, but it is based on the NVIDIA vid2vid framework, which may have its own licensing terms. Compatibility for commercial use should be verified against the underlying vid2vid license.

Limitations & Caveats

The setup process involves multiple external dependencies and pre-trained model downloads. Training from scratch requires significant data preparation and computational resources. The project is research-oriented, and production-readiness is not explicitly stated.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days