av_hubert  by facebookresearch

Self-supervised learning framework for audio-visual speech research

Created 3 years ago
937 stars

Top 39.1% on SourcePulse

GitHubView on GitHub
Project Summary

AV-HuBERT is a self-supervised learning framework for audio-visual speech representation, targeting researchers and engineers in speech recognition and lip reading. It achieves state-of-the-art results by jointly learning from audio and visual speech cues, enabling robust speech recognition in challenging conditions.

How It Works

AV-HuBERT employs a masked multimodal cluster prediction approach, inspired by BERT. It masks portions of the input audio and visual streams and trains a model to predict discrete cluster assignments for the masked segments. This contrastive learning objective encourages the model to capture synchronized audio-visual speech representations.

Quick Start & Requirements

  • Installation: Clone the repository, initialize and update submodules, and install dependencies via pip install -r requirements.txt after activating a Python 3.8 conda environment.
  • Prerequisites: Python 3.8, Fairseq, and potentially large datasets (LRS3, VoxCeleb2) for training.
  • Demo: A Colab notebook is available for a lip-reading demo.
  • Documentation: Links to pre-trained checkpoints and data preparation steps are provided.

Highlighted Details

  • State-of-the-art performance on LRS3 benchmark for lip reading, ASR, and audio-visual speech recognition.
  • Robustness to noisy environments can be tested by adding noise to audio inputs.
  • Supports decoding for lip reading (video only), ASR (audio only), and audio-visual speech recognition (audio+video).

Maintenance & Community

Developed by Meta AI. Citation details for relevant research papers are provided.

Licensing & Compatibility

The software is provided under a custom license agreement from Meta Platforms, Inc. Users must agree to the terms to use the software.

Limitations & Caveats

The framework requires significant computational resources and large datasets for training. Specific data preparation steps are necessary for both pre-training and fine-tuning.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Shane Thomas Shane Thomas(Cofounder of Mastra), Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI), and
2 more.

Wav2Lip by Rudrabha

0.2%
12k
Lip-syncing tool for generating videos from speech
Created 5 years ago
Updated 2 months ago
Feedback? Help us improve.