edgedict by theblackcat102

RNN-Transducer for online speech recognition

Created 5 years ago

293 stars

Top 90.2% on SourcePulse

Project Summary

This repository provides an implementation of an online speech recognition system using RNN Transducer, targeting researchers and developers interested in real-time speech-to-text applications. It offers a pre-trained model capable of achieving faster-than-YouTube live captioning speeds on modest hardware and includes tools for visualizing the alignment between audio and transcribed text.

How It Works

The system leverages the RNN Transducer (RNN-T) architecture, a sequence-to-sequence model well-suited for online decoding due to its ability to process audio and predict text incrementally. The implementation is trained on over 2000 hours of diverse speech data, with a focus on achieving low latency for real-time applications. It supports exporting the RNN-T model to ONNX and OpenVINO formats for optimized inference.

Quick Start & Requirements

Install: Follow instructions for installing PyTorch, torchaudio, Apex, and warprnnt-pytorch.
Prerequisites: Python, CUDA, cuDNN, and potentially specific PyTorch versions (e.g., 1.5.0) are required.
Data: Datasets like Common Voice, YouTube Caption, LibriSpeech, and TEDLIUM are supported. Preprocessing scripts are provided.
Demo: Run python youtube_live.py with provided flag files and URLs for live YouTube captioning.
Docs: Configuration details are in rnnt/args.py and flag files.

Highlighted Details

Demonstrates online decoding capability of RNN Transducer.
Models are exportable to ONNX and OpenVINO for optimized inference.
Achieves 4-10 seconds faster live captioning than YouTube on a dual-core Intel i5.
Visualizes audio-text alignment similar to Graves et al. 2013.

Maintenance & Community

The project is a joint collaboration. Further community engagement channels are not explicitly listed in the README.

Licensing & Compatibility

The repository's licensing is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

The project notes that its best model achieves a 16.3% WER on Librispeech test-clean, which is higher than common baselines. Training RNN-T models is resource-intensive, requiring significant compute power and storage. PyTorch's dataparallel was noted as broken in version 1.4.0, necessitating version 1.5.0 or later. OpenVINO inference was observed to be slower than PyTorch and ONNX runtime.

edgedict by theblackcat102

Explore Similar Projects

speech-recognition-uk by egorsmkv

SpeechGPT-2.0-preview by OpenMOSS

echogarden by echogarden-project

LLaSM by LinkSoul-AI

VITA-Audio by VITA-MLLM

fast-voice-assistant by dsa

speech_course by yandexdataschool

QuickAgent by gkamradt

athena by athena-team

RealtimeSTT by KoljaB

sherpa-onnx by k2-fsa

whisper by openai