av_hubert  by facebookresearch

Self-supervised learning framework for audio-visual speech research

created 3 years ago
925 stars

Top 40.3% on sourcepulse

GitHubView on GitHub
Project Summary

AV-HuBERT is a self-supervised learning framework for audio-visual speech representation, targeting researchers and engineers in speech recognition and lip reading. It achieves state-of-the-art results by jointly learning from audio and visual speech cues, enabling robust speech recognition in challenging conditions.

How It Works

AV-HuBERT employs a masked multimodal cluster prediction approach, inspired by BERT. It masks portions of the input audio and visual streams and trains a model to predict discrete cluster assignments for the masked segments. This contrastive learning objective encourages the model to capture synchronized audio-visual speech representations.

Quick Start & Requirements

  • Installation: Clone the repository, initialize and update submodules, and install dependencies via pip install -r requirements.txt after activating a Python 3.8 conda environment.
  • Prerequisites: Python 3.8, Fairseq, and potentially large datasets (LRS3, VoxCeleb2) for training.
  • Demo: A Colab notebook is available for a lip-reading demo.
  • Documentation: Links to pre-trained checkpoints and data preparation steps are provided.

Highlighted Details

  • State-of-the-art performance on LRS3 benchmark for lip reading, ASR, and audio-visual speech recognition.
  • Robustness to noisy environments can be tested by adding noise to audio inputs.
  • Supports decoding for lip reading (video only), ASR (audio only), and audio-visual speech recognition (audio+video).

Maintenance & Community

Developed by Meta AI. Citation details for relevant research papers are provided.

Licensing & Compatibility

The software is provided under a custom license agreement from Meta Platforms, Inc. Users must agree to the terms to use the software.

Limitations & Caveats

The framework requires significant computational resources and large datasets for training. Specific data preparation steps are necessary for both pre-training and fine-tuning.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
28 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

AudioGPT by AIGC-Audio

0.1%
10k
Audio processing and generation research project
created 2 years ago
updated 1 year ago
Feedback? Help us improve.