Lip2Wav by Rudrabha

Lip-to-speech synthesis for generating speech from lip movements

Created 5 years ago

715 stars

Top 48.1% on SourcePulse

Project Summary

This repository provides code for Lip2Wav, a system that generates intelligible speech from lip movements in unconstrained video settings. It is targeted at researchers and developers in audio-visual speech processing and aims to enable high-quality, style-accurate lip-to-speech synthesis.

How It Works

Lip2Wav employs a sequence-to-sequence modeling approach to map visual lip movements directly to speech. It leverages pre-trained models and provides complete training and inference code, allowing for the generation of speech from video inputs. The system is designed to capture individual speaking styles for more accurate synthesis.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.7.4, ffmpeg, and a downloaded face detection model (s3fd.pth).
Setup: Requires downloading speaker-specific pre-trained models and datasets. The download_speaker.sh script can be used to fetch video data. Preprocessing involves running python preprocess.py. Inference is initiated with python complete_test_generate.py.
Links: Paper, Project Page, Demo Video

Highlighted Details

First work to generate intelligible speech from lip movements in unconstrained settings.
Sequence-to-sequence modeling of the lip-to-speech problem.
Released dataset for 5 speakers with 100+ hours of video data.
Released complete training code and pre-trained models.
Code for calculating PESQ, ESTOI, and STOI metrics is available.

Maintenance & Community

The project is associated with CVPR 2020. Further information on community or ongoing maintenance is not detailed in the README.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive for commercial use and closed-source linking.

Limitations & Caveats

The code is tested with Python 3.7.4, and compatibility with newer Python versions is not guaranteed. The README points to a separate repository, Wav2Lip, for lip-sync talking face videos, indicating a potential divergence in focus or development.

Lip2Wav by Rudrabha

Explore Similar Projects

SpeechGPT-2.0-preview by OpenMOSS

Meta-voicebox by SpeechifyInc

UniAudio by yangdongchao

assem-vc by maum-ai

GenerSpeech by Rongjiehuang

DiffGAN-TTS by keonlee9420

FastDiff by Rongjiehuang

VITA-Audio by VITA-MLLM

parler-tts by huggingface

StyleTTS2 by yl4579

Zonos by Zyphra

AudioGPT by AIGC-Audio