Lip2Wav  by Rudrabha

Lip-to-speech synthesis for generating speech from lip movements

created 5 years ago
708 stars

Top 49.4% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides code for Lip2Wav, a system that generates intelligible speech from lip movements in unconstrained video settings. It is targeted at researchers and developers in audio-visual speech processing and aims to enable high-quality, style-accurate lip-to-speech synthesis.

How It Works

Lip2Wav employs a sequence-to-sequence modeling approach to map visual lip movements directly to speech. It leverages pre-trained models and provides complete training and inference code, allowing for the generation of speech from video inputs. The system is designed to capture individual speaking styles for more accurate synthesis.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.7.4, ffmpeg, and a downloaded face detection model (s3fd.pth).
  • Setup: Requires downloading speaker-specific pre-trained models and datasets. The download_speaker.sh script can be used to fetch video data. Preprocessing involves running python preprocess.py. Inference is initiated with python complete_test_generate.py.
  • Links: Paper, Project Page, Demo Video

Highlighted Details

  • First work to generate intelligible speech from lip movements in unconstrained settings.
  • Sequence-to-sequence modeling of the lip-to-speech problem.
  • Released dataset for 5 speakers with 100+ hours of video data.
  • Released complete training code and pre-trained models.
  • Code for calculating PESQ, ESTOI, and STOI metrics is available.

Maintenance & Community

The project is associated with CVPR 2020. Further information on community or ongoing maintenance is not detailed in the README.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive for commercial use and closed-source linking.

Limitations & Caveats

The code is tested with Python 3.7.4, and compatibility with newer Python versions is not guaranteed. The README points to a separate repository, Wav2Lip, for lip-sync talking face videos, indicating a potential divergence in focus or development.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

StyleTTS2 by yl4579

0.2%
6k
Text-to-speech model achieving human-level synthesis
created 2 years ago
updated 11 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

AudioGPT by AIGC-Audio

0.1%
10k
Audio processing and generation research project
created 2 years ago
updated 1 year ago
Feedback? Help us improve.