Visual_Speech_Recognition_for_Multiple_Languages by mpc001

VSR research code for multilingual audio-visual speech recognition

Created 4 years ago

458 stars

Top 66.0% on SourcePulse

Project Summary

This repository provides a framework for multi-language Visual Speech Recognition (VSR) and Audio-Visual Speech Recognition (AV-ASR). It targets researchers and developers working on robust speech recognition systems that can leverage visual cues, particularly in challenging conditions or for languages where audio-only recognition is less effective. The project offers pre-trained models and training recipes, aiming to achieve state-of-the-art performance across various datasets.

How It Works

The system employs a deep learning approach, likely utilizing Conformer-based architectures as indicated by its predecessor. It processes video frames to extract visual speech features and can optionally integrate audio information for improved accuracy. The framework supports multiple languages and datasets, including LRS2, LRS3, CMLR, CMU-MOSEAS, GRID, Lombard GRID, and TCD-TIMIT, offering separate models for visual-only, audio-only, and audio-visual configurations.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -y -n autoavsr python=3.8), activate it (conda activate autoavsr), install PyTorch, torchvision, torchaudio, and other dependencies (pip install -r requirements.txt). Download pre-trained models and language models from the model zoo.
Prerequisites: Python 3.8, PyTorch, and optionally a face/landmark detector (RetinaFace or MediaPipe).
Setup: Requires downloading pre-trained models and language models, which can be substantial in size (e.g., ~18GB for LRS3 landmarks).
Links: Introduction, Preparation, Benchmark evaluation, Speech prediction, Model zoo.

Highlighted Details

Achieves 19.1% WER for automatic, 1.0% for audio-only, and 0.9% for audio-visual speech recognition on LRS3.
Supports a wide range of datasets and languages, including English, Mandarin, Spanish, Portuguese, and French.
Provides pre-trained models for visual-only, audio-only, and audio-visual configurations.
Includes tools for mouth ROI cropping and visual speech feature extraction.

Maintenance & Community

The project is authored by Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, and Maja Pantic. Recent updates in 2023 include the release of training recipes for real-time AV-ASR and AutoAVSR models. Contact information for Pingchuan Ma is provided.

Licensing & Compatibility

The code is licensed for comparative or benchmarking purposes only and can only be used for non-commercial purposes.

Limitations & Caveats

The license explicitly restricts usage to non-commercial and benchmarking activities, limiting its applicability for commercial product development. The setup requires downloading significant amounts of data, including large landmark files.

Visual_Speech_Recognition_for_Multiple_Languages by mpc001

Explore Similar Projects

Ola by Ola-Omni

huggingsound by jonatasgrosman

SenseVoice.cpp by lovemefan

chaplin by amanvirparhar

fish-diffusion by fishaudio

SLAM-LLM by X-LANCE

av_hubert by facebookresearch

athena by athena-team

FunASR by modelscope

speechbrain by speechbrain

PaddleSpeech by PaddlePaddle

Wav2Lip by Rudrabha