pika  by tencent-ailab

Speech processing toolkit for end-to-end recognition

Created 4 years ago
343 stars

Top 80.6% on SourcePulse

GitHubView on GitHub
Project Summary

PIKA is a lightweight, PyTorch-based speech processing toolkit that leverages (Py)Kaldi for efficient data handling and feature extraction, primarily focusing on end-to-end speech recognition. It is designed for researchers and developers working with speech data who need a flexible and performant toolkit. PIKA offers advanced features like on-the-fly data augmentation, various model architectures (TDNN, Transformer), and supports RNNT training with Minimum Bayes Risk (MBR) and external N-gram FSTs for rescoring.

How It Works

PIKA integrates PyTorch for its deep learning capabilities and Kaldi for robust data preparation and feature extraction. This hybrid approach allows for efficient data loading and augmentation directly within the training pipeline. The toolkit supports recurrent neural network transducer (RNNT) models, enabling end-to-end training and decoding. It also incorporates techniques like block model update filtering (BMUF) for distributed training and offers the flexibility to integrate Language Augmented Sequence (LAS) models for forward and backward rescoring of RNNT outputs, enhancing recognition accuracy.

Quick Start & Requirements

  • Installation: Recommended via Anaconda. Ensure PyTorch (>= 1.0.0 recommended) and PyKaldi are installed. PyKaldi should be built with ninja for optimal performance. CUDA-Warp RNN-Transducer is required for the RNNT loss module. Other dependencies are listed in requirements.txt.
  • Data Preparation: Training data requires wav.scp and label.txt files. label.txt maps utterance IDs to sequences of one-based indexed labels, with 0 reserved for blank symbols.
  • Training & Decoding: Scripts for training and decoding are available in the egs directory. Key scripts include egs/train_transducer_bmuf_otfaug.sh for data preparation and RNNT training, egs/train_transducer_mbr_bmuf_otfaug.sh for MBR training, and egs/train_las_rescorer_bmuf_otfaug.sh for training LAS rescorers. Decoding is handled by egs/eval_transducer.sh.
  • Resources: Hyperparameters are tuned for large-scale (60k+ hours) training; users may need to re-tune for optimal performance on different datasets.

Highlighted Details

  • Supports on-the-fly data augmentation and feature extraction.
  • Implements RNNT training with Minimum Bayes Risk (MBR) and BMUF for distributed training.
  • Enables forward and backward LAS rescoring for RNNT models.
  • Facilitates RNNT decoding with external N-gram FSTs for shallow fusion.

Maintenance & Community

  • This is not an officially supported Tencent product.
  • No specific community links (Discord, Slack, etc.) or notable contributors are mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. Users should exercise caution regarding usage and distribution.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Hyper-parameters are optimized for large-scale Mandarin speech data and may require significant tuning for other languages or datasets. The WER/CER scoring script is specific to Mandarin, necessitating modifications for other languages.

Health Check
Last Commit

4 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.