edgedict  by theblackcat102

RNN-Transducer for online speech recognition

created 5 years ago
293 stars

Top 91.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an implementation of an online speech recognition system using RNN Transducer, targeting researchers and developers interested in real-time speech-to-text applications. It offers a pre-trained model capable of achieving faster-than-YouTube live captioning speeds on modest hardware and includes tools for visualizing the alignment between audio and transcribed text.

How It Works

The system leverages the RNN Transducer (RNN-T) architecture, a sequence-to-sequence model well-suited for online decoding due to its ability to process audio and predict text incrementally. The implementation is trained on over 2000 hours of diverse speech data, with a focus on achieving low latency for real-time applications. It supports exporting the RNN-T model to ONNX and OpenVINO formats for optimized inference.

Quick Start & Requirements

  • Install: Follow instructions for installing PyTorch, torchaudio, Apex, and warprnnt-pytorch.
  • Prerequisites: Python, CUDA, cuDNN, and potentially specific PyTorch versions (e.g., 1.5.0) are required.
  • Data: Datasets like Common Voice, YouTube Caption, LibriSpeech, and TEDLIUM are supported. Preprocessing scripts are provided.
  • Demo: Run python youtube_live.py with provided flag files and URLs for live YouTube captioning.
  • Docs: Configuration details are in rnnt/args.py and flag files.

Highlighted Details

  • Demonstrates online decoding capability of RNN Transducer.
  • Models are exportable to ONNX and OpenVINO for optimized inference.
  • Achieves 4-10 seconds faster live captioning than YouTube on a dual-core Intel i5.
  • Visualizes audio-text alignment similar to Graves et al. 2013.

Maintenance & Community

The project is a joint collaboration. Further community engagement channels are not explicitly listed in the README.

Licensing & Compatibility

The repository's licensing is not explicitly stated in the provided README. Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

The project notes that its best model achieves a 16.3% WER on Librispeech test-clean, which is higher than common baselines. Training RNN-T models is resource-intensive, requiring significant compute power and storage. PyTorch's dataparallel was noted as broken in version 1.4.0, necessitating version 1.5.0 or later. OpenVINO inference was observed to be slower than PyTorch and ONNX runtime.

Health Check
Last commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.