Visual_Speech_Recognition_for_Multiple_Languages  by mpc001

VSR research code for multilingual audio-visual speech recognition

Created 3 years ago
436 stars

Top 68.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a framework for multi-language Visual Speech Recognition (VSR) and Audio-Visual Speech Recognition (AV-ASR). It targets researchers and developers working on robust speech recognition systems that can leverage visual cues, particularly in challenging conditions or for languages where audio-only recognition is less effective. The project offers pre-trained models and training recipes, aiming to achieve state-of-the-art performance across various datasets.

How It Works

The system employs a deep learning approach, likely utilizing Conformer-based architectures as indicated by its predecessor. It processes video frames to extract visual speech features and can optionally integrate audio information for improved accuracy. The framework supports multiple languages and datasets, including LRS2, LRS3, CMLR, CMU-MOSEAS, GRID, Lombard GRID, and TCD-TIMIT, offering separate models for visual-only, audio-only, and audio-visual configurations.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -y -n autoavsr python=3.8), activate it (conda activate autoavsr), install PyTorch, torchvision, torchaudio, and other dependencies (pip install -r requirements.txt). Download pre-trained models and language models from the model zoo.
  • Prerequisites: Python 3.8, PyTorch, and optionally a face/landmark detector (RetinaFace or MediaPipe).
  • Setup: Requires downloading pre-trained models and language models, which can be substantial in size (e.g., ~18GB for LRS3 landmarks).
  • Links: Introduction, Preparation, Benchmark evaluation, Speech prediction, Model zoo.

Highlighted Details

  • Achieves 19.1% WER for automatic, 1.0% for audio-only, and 0.9% for audio-visual speech recognition on LRS3.
  • Supports a wide range of datasets and languages, including English, Mandarin, Spanish, Portuguese, and French.
  • Provides pre-trained models for visual-only, audio-only, and audio-visual configurations.
  • Includes tools for mouth ROI cropping and visual speech feature extraction.

Maintenance & Community

The project is authored by Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, and Maja Pantic. Recent updates in 2023 include the release of training recipes for real-time AV-ASR and AutoAVSR models. Contact information for Pingchuan Ma is provided.

Licensing & Compatibility

The code is licensed for comparative or benchmarking purposes only and can only be used for non-commercial purposes.

Limitations & Caveats

The license explicitly restricts usage to non-commercial and benchmarking activities, limiting its applicability for commercial product development. The setup requires downloading significant amounts of data, including large landmark files.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Shane Thomas Shane Thomas(Cofounder of Mastra), Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI), and
2 more.

Wav2Lip by Rudrabha

0.2%
12k
Lip-syncing tool for generating videos from speech
Created 5 years ago
Updated 2 months ago
Feedback? Help us improve.