pika by tencent-ailab

Speech processing toolkit for end-to-end recognition

Created 5 years ago

344 stars

Top 80.4% on SourcePulse

Project Summary

PIKA is a lightweight, PyTorch-based speech processing toolkit that leverages (Py)Kaldi for efficient data handling and feature extraction, primarily focusing on end-to-end speech recognition. It is designed for researchers and developers working with speech data who need a flexible and performant toolkit. PIKA offers advanced features like on-the-fly data augmentation, various model architectures (TDNN, Transformer), and supports RNNT training with Minimum Bayes Risk (MBR) and external N-gram FSTs for rescoring.

How It Works

PIKA integrates PyTorch for its deep learning capabilities and Kaldi for robust data preparation and feature extraction. This hybrid approach allows for efficient data loading and augmentation directly within the training pipeline. The toolkit supports recurrent neural network transducer (RNNT) models, enabling end-to-end training and decoding. It also incorporates techniques like block model update filtering (BMUF) for distributed training and offers the flexibility to integrate Language Augmented Sequence (LAS) models for forward and backward rescoring of RNNT outputs, enhancing recognition accuracy.

Quick Start & Requirements

Installation: Recommended via Anaconda. Ensure PyTorch (>= 1.0.0 recommended) and PyKaldi are installed. PyKaldi should be built with ninja for optimal performance. CUDA-Warp RNN-Transducer is required for the RNNT loss module. Other dependencies are listed in requirements.txt.
Data Preparation: Training data requires wav.scp and label.txt files. label.txt maps utterance IDs to sequences of one-based indexed labels, with 0 reserved for blank symbols.
Training & Decoding: Scripts for training and decoding are available in the egs directory. Key scripts include egs/train_transducer_bmuf_otfaug.sh for data preparation and RNNT training, egs/train_transducer_mbr_bmuf_otfaug.sh for MBR training, and egs/train_las_rescorer_bmuf_otfaug.sh for training LAS rescorers. Decoding is handled by egs/eval_transducer.sh.
Resources: Hyperparameters are tuned for large-scale (60k+ hours) training; users may need to re-tune for optimal performance on different datasets.

Highlighted Details

Supports on-the-fly data augmentation and feature extraction.
Implements RNNT training with Minimum Bayes Risk (MBR) and BMUF for distributed training.
Enables forward and backward LAS rescoring for RNNT models.
Facilitates RNNT decoding with external N-gram FSTs for shallow fusion.

Maintenance & Community

This is not an officially supported Tencent product.
No specific community links (Discord, Slack, etc.) or notable contributors are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. Users should exercise caution regarding usage and distribution.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Hyper-parameters are optimized for large-scale Mandarin speech data and may require significant tuning for other languages or datasets. The WER/CER scoring script is specific to Mandarin, necessitating modifications for other languages.

pika by tencent-ailab

Explore Similar Projects

radtts by NVIDIA

awesome-kaldi by YoavRamon

pase by santi-pdp

MASR by yeyupiaoling

espresso by freewym

asv-subtools by Snowdar

neural_sp by hirofumi0810

kospeech by sooftware

athena by athena-team

lightseq by bytedance

pytorch-kaldi by mravanelli

speech-to-text-wavenet by buriburisuri