attention-lvcsr  by rizar

End-to-end attention-based speech recognition

Created 10 years ago
262 stars

Top 97.3% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides an end-to-end attention-based large vocabulary speech recognition system, serving as a reference implementation for associated research papers. It is targeted at researchers and practitioners in speech recognition who need to understand or reproduce the results of the cited work. The primary benefit is the availability of a working, albeit dated, implementation of an attention-based ASR system.

How It Works

The system utilizes an attention-based mechanism for speech recognition, allowing the model to focus on relevant parts of the input audio sequence when generating output tokens. This approach, detailed in the referenced papers, offers an alternative to traditional frame-synchronous models by directly mapping variable-length audio segments to variable-length text sequences.

Quick Start & Requirements

  • Installation: Requires compiling Kaldi with --shared and --use-cuda=no options, installing Python packages (pykwalify, toposort, pyyaml, numpy, pandas, pyfst), and installing kaldi-python via python setup.py install.
  • Prerequisites: Kaldi, OpenFst, Python packages listed above. The Wall Street Journal (WSJ) dataset (LDC93S6B and LDC94S13B) is required for replicating results.
  • Dependencies: The codebase includes custom modified subtrees of Theano, Blocks, and Fuel.
  • Setup: Users are directed to exp/wsj for specific instructions on replicating results with the WSJ dataset.

Highlighted Details

  • Reference implementation for "End-to-End Attention-based Large Vocabulary Speech Recognition" and "Task Loss Estimation for Sequence Prediction" papers.
  • Includes custom modified subtrees for Theano, Blocks, Fuel, picklable-itertools, and Blocks-extras.

Maintenance & Community

This codebase is no longer maintained and is based on outdated technologies (Theano, Blocks). Users are recommended to explore more modern implementations.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: While the MIT license generally permits commercial use, the project's reliance on outdated technologies and lack of maintenance may pose compatibility challenges for modern systems.

Limitations & Caveats

The project explicitly states it is no longer maintained due to its reliance on outdated technologies like Theano and Blocks, recommending users seek more modern alternatives. This significantly limits its practical applicability for current ASR development.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
2 more.

pyctcdecode by kensho-technologies

0%
460
CTC beam search decoder for speech recognition
Created 4 years ago
Updated 2 years ago
Feedback? Help us improve.