Self-supervised learning framework for audio-visual speech research
Top 40.3% on sourcepulse
AV-HuBERT is a self-supervised learning framework for audio-visual speech representation, targeting researchers and engineers in speech recognition and lip reading. It achieves state-of-the-art results by jointly learning from audio and visual speech cues, enabling robust speech recognition in challenging conditions.
How It Works
AV-HuBERT employs a masked multimodal cluster prediction approach, inspired by BERT. It masks portions of the input audio and visual streams and trains a model to predict discrete cluster assignments for the masked segments. This contrastive learning objective encourages the model to capture synchronized audio-visual speech representations.
Quick Start & Requirements
pip install -r requirements.txt
after activating a Python 3.8 conda environment.Highlighted Details
Maintenance & Community
Developed by Meta AI. Citation details for relevant research papers are provided.
Licensing & Compatibility
The software is provided under a custom license agreement from Meta Platforms, Inc. Users must agree to the terms to use the software.
Limitations & Caveats
The framework requires significant computational resources and large datasets for training. Specific data preparation steps are necessary for both pre-training and fine-tuning.
1 year ago
1 day