Visual_Speech_Recognition_for_Multiple_Languages  by mpc001

VSR research code for multilingual audio-visual speech recognition

created 3 years ago
425 stars

Top 70.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a framework for multi-language Visual Speech Recognition (VSR) and Audio-Visual Speech Recognition (AV-ASR). It targets researchers and developers working on robust speech recognition systems that can leverage visual cues, particularly in challenging conditions or for languages where audio-only recognition is less effective. The project offers pre-trained models and training recipes, aiming to achieve state-of-the-art performance across various datasets.

How It Works

The system employs a deep learning approach, likely utilizing Conformer-based architectures as indicated by its predecessor. It processes video frames to extract visual speech features and can optionally integrate audio information for improved accuracy. The framework supports multiple languages and datasets, including LRS2, LRS3, CMLR, CMU-MOSEAS, GRID, Lombard GRID, and TCD-TIMIT, offering separate models for visual-only, audio-only, and audio-visual configurations.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -y -n autoavsr python=3.8), activate it (conda activate autoavsr), install PyTorch, torchvision, torchaudio, and other dependencies (pip install -r requirements.txt). Download pre-trained models and language models from the model zoo.
  • Prerequisites: Python 3.8, PyTorch, and optionally a face/landmark detector (RetinaFace or MediaPipe).
  • Setup: Requires downloading pre-trained models and language models, which can be substantial in size (e.g., ~18GB for LRS3 landmarks).
  • Links: Introduction, Preparation, Benchmark evaluation, Speech prediction, Model zoo.

Highlighted Details

  • Achieves 19.1% WER for automatic, 1.0% for audio-only, and 0.9% for audio-visual speech recognition on LRS3.
  • Supports a wide range of datasets and languages, including English, Mandarin, Spanish, Portuguese, and French.
  • Provides pre-trained models for visual-only, audio-only, and audio-visual configurations.
  • Includes tools for mouth ROI cropping and visual speech feature extraction.

Maintenance & Community

The project is authored by Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, and Maja Pantic. Recent updates in 2023 include the release of training recipes for real-time AV-ASR and AutoAVSR models. Contact information for Pingchuan Ma is provided.

Licensing & Compatibility

The code is licensed for comparative or benchmarking purposes only and can only be used for non-commercial purposes.

Limitations & Caveats

The license explicitly restricts usage to non-commercial and benchmarking activities, limiting its applicability for commercial product development. The setup requires downloading significant amounts of data, including large landmark files.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
23 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.