av_hubert by facebookresearch

Self-supervised learning framework for audio-visual speech research

Created 4 years ago

964 stars

Top 38.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

Project Summary

AV-HuBERT is a self-supervised learning framework for audio-visual speech representation, targeting researchers and engineers in speech recognition and lip reading. It achieves state-of-the-art results by jointly learning from audio and visual speech cues, enabling robust speech recognition in challenging conditions.

How It Works

AV-HuBERT employs a masked multimodal cluster prediction approach, inspired by BERT. It masks portions of the input audio and visual streams and trains a model to predict discrete cluster assignments for the masked segments. This contrastive learning objective encourages the model to capture synchronized audio-visual speech representations.

Quick Start & Requirements

Installation: Clone the repository, initialize and update submodules, and install dependencies via pip install -r requirements.txt after activating a Python 3.8 conda environment.
Prerequisites: Python 3.8, Fairseq, and potentially large datasets (LRS3, VoxCeleb2) for training.
Demo: A Colab notebook is available for a lip-reading demo.
Documentation: Links to pre-trained checkpoints and data preparation steps are provided.

Highlighted Details

State-of-the-art performance on LRS3 benchmark for lip reading, ASR, and audio-visual speech recognition.
Robustness to noisy environments can be tested by adding noise to audio inputs.
Supports decoding for lip reading (video only), ASR (audio only), and audio-visual speech recognition (audio+video).

Maintenance & Community

Developed by Meta AI. Citation details for relevant research papers are provided.

Licensing & Compatibility

The software is provided under a custom license agreement from Meta Platforms, Inc. Users must agree to the terms to use the software.

Limitations & Caveats

The framework requires significant computational resources and large datasets for training. Specific data preparation steps are necessary for both pre-training and fine-tuning.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days