RNN-Transducer for online speech recognition
Top 91.2% on sourcepulse
This repository provides an implementation of an online speech recognition system using RNN Transducer, targeting researchers and developers interested in real-time speech-to-text applications. It offers a pre-trained model capable of achieving faster-than-YouTube live captioning speeds on modest hardware and includes tools for visualizing the alignment between audio and transcribed text.
How It Works
The system leverages the RNN Transducer (RNN-T) architecture, a sequence-to-sequence model well-suited for online decoding due to its ability to process audio and predict text incrementally. The implementation is trained on over 2000 hours of diverse speech data, with a focus on achieving low latency for real-time applications. It supports exporting the RNN-T model to ONNX and OpenVINO formats for optimized inference.
Quick Start & Requirements
python youtube_live.py
with provided flag files and URLs for live YouTube captioning.rnnt/args.py
and flag files.Highlighted Details
Maintenance & Community
The project is a joint collaboration. Further community engagement channels are not explicitly listed in the README.
Licensing & Compatibility
The repository's licensing is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require clarification.
Limitations & Caveats
The project notes that its best model achieves a 16.3% WER on Librispeech test-clean, which is higher than common baselines. Training RNN-T models is resource-intensive, requiring significant compute power and storage. PyTorch's dataparallel
was noted as broken in version 1.4.0, necessitating version 1.5.0 or later. OpenVINO inference was observed to be slower than PyTorch and ONNX runtime.
4 years ago
Inactive