Speech recognition using WaveNet in TensorFlow
Top 12.5% on sourcepulse
This repository provides an end-to-end English speech recognition system based on DeepMind's WaveNet architecture, implemented in TensorFlow. It targets researchers and developers interested in replicating or extending WaveNet for speech-to-text tasks, offering a practical implementation that addresses some of the original paper's omissions.
How It Works
The system utilizes a dilated convolutional neural network architecture, adapted from DeepMind's WaveNet, to process raw audio. Unlike the original WaveNet for speech synthesis, this implementation focuses on speech recognition. Key modifications include using the VCTK dataset, replacing the mean-pooling layer with dilated convolutions for down-sampling to manage GPU memory, and employing a CTC loss function for end-to-end training with sentence-level labels, rather than phoneme classification.
Quick Start & Requirements
tensorflow==1.0.0
, sugartensor==1.0.0.2
, pandas>=0.19.2
, librosa==0.5.0
, scikits.audiolab==0.11.0
.Highlighted Details
recognize.py
script for transforming speech wave files to text using a pre-trained model.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The implementation lacks language model integration, leading to potential misspellings, incorrect capitalization, and missing punctuation in the output. The original paper's details were not fully replicated due to implementation challenges and resource constraints.
3 years ago
Inactive