speech-to-text-wavenet  by buriburisuri

Speech recognition using WaveNet in TensorFlow

created 8 years ago
3,983 stars

Top 12.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an end-to-end English speech recognition system based on DeepMind's WaveNet architecture, implemented in TensorFlow. It targets researchers and developers interested in replicating or extending WaveNet for speech-to-text tasks, offering a practical implementation that addresses some of the original paper's omissions.

How It Works

The system utilizes a dilated convolutional neural network architecture, adapted from DeepMind's WaveNet, to process raw audio. Unlike the original WaveNet for speech synthesis, this implementation focuses on speech recognition. Key modifications include using the VCTK dataset, replacing the mean-pooling layer with dilated convolutions for down-sampling to manage GPU memory, and employing a CTC loss function for end-to-end training with sentence-level labels, rather than phoneme classification.

Quick Start & Requirements

  • Install: Exact dependency versions are critical: tensorflow==1.0.0, sugartensor==1.0.0.2, pandas>=0.19.2, librosa==0.5.0, scikits.audiolab==0.11.0.
  • Prerequisites: FFmpeg and SoX are required for audio preprocessing. SSD is recommended for faster data loading.
  • Training: Training on 3 Nvidia 1080 GPUs took 40 hours for 50 epochs. Batch size can be reduced to 4 if out-of-memory errors occur.
  • Resources: Pre-processing involves converting SPH files to WAV and then to MFCC features.
  • Links: [Docker support](docker README.md) is available.

Highlighted Details

  • Implemented end-to-end sentence-level English speech recognition.
  • Uses VCTK, LibriSpeech, and TEDLIUM release 2 datasets, totaling 240,612 sentences for training.
  • Achieved CTC losses of 69.94, 66.83, and 77.31 on train, valid, and test sets respectively at epoch 40.
  • Offers a recognize.py script for transforming speech wave files to text using a pre-trained model.

Maintenance & Community

  • Developed by Namju Kim and Kyubyong Park at KakaoBrain Corp.
  • Mentions other related repositories by the authors, including WaveNet implementations for speech synthesis.

Licensing & Compatibility

  • The README does not explicitly state a license. However, the code is publicly available on GitHub.

Limitations & Caveats

The implementation lacks language model integration, leading to potential misspellings, incorrect capitalization, and missing punctuation in the output. The original paper's details were not fully replicated due to implementation challenges and resource constraints.

Health Check
Last commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.