VSR research code for multilingual audio-visual speech recognition
Top 70.5% on sourcepulse
This repository provides a framework for multi-language Visual Speech Recognition (VSR) and Audio-Visual Speech Recognition (AV-ASR). It targets researchers and developers working on robust speech recognition systems that can leverage visual cues, particularly in challenging conditions or for languages where audio-only recognition is less effective. The project offers pre-trained models and training recipes, aiming to achieve state-of-the-art performance across various datasets.
How It Works
The system employs a deep learning approach, likely utilizing Conformer-based architectures as indicated by its predecessor. It processes video frames to extract visual speech features and can optionally integrate audio information for improved accuracy. The framework supports multiple languages and datasets, including LRS2, LRS3, CMLR, CMU-MOSEAS, GRID, Lombard GRID, and TCD-TIMIT, offering separate models for visual-only, audio-only, and audio-visual configurations.
Quick Start & Requirements
conda create -y -n autoavsr python=3.8
), activate it (conda activate autoavsr
), install PyTorch, torchvision, torchaudio, and other dependencies (pip install -r requirements.txt
). Download pre-trained models and language models from the model zoo.Highlighted Details
Maintenance & Community
The project is authored by Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, and Maja Pantic. Recent updates in 2023 include the release of training recipes for real-time AV-ASR and AutoAVSR models. Contact information for Pingchuan Ma is provided.
Licensing & Compatibility
The code is licensed for comparative or benchmarking purposes only and can only be used for non-commercial purposes.
Limitations & Caveats
The license explicitly restricts usage to non-commercial and benchmarking activities, limiting its applicability for commercial product development. The setup requires downloading significant amounts of data, including large landmark files.
1 year ago
Inactive