CAT by thu-spmi

ASR toolkit for data-efficient end-to-end speech recognition

Created 6 years ago

362 stars

Top 77.9% on SourcePulse

Project Summary

CAT is a toolkit for data-efficient end-to-end Automatic Speech Recognition (ASR), targeting researchers and practitioners seeking to combine the benefits of hybrid and end-to-end ASR approaches. It offers a complete workflow for Conditional Random Field (CRF)-based ASR, aiming for improved performance with less data.

How It Works

CAT utilizes a CRF-based framework with a Connectionist Temporal Classification (CTC) inspired state topology. This approach combines global normalization modeling and discriminative training, bridging the gap between modular hybrid systems and unified end-to-end neural networks. The advantage lies in achieving data efficiency and potentially lower latency by judiciously balancing modularity and joint optimization.

Quick Start & Requirements

Install via git clone https://github.com/thu-spmi/CAT.git && cd CAT followed by ./install.sh.
Dependencies: PyTorch >= 1.9.0, CUDA-compatible device, NVIDIA driver, CUDA lib. Kaldi is optional but recommended for CTC-CRF training and data preparation. Torchaudio can be used as an alternative for feature extraction.
Further guidance is available in the TEMPLATE and data.sh files.

Highlighted Details

Full-fledged CUDA/C/C++ implementation of CTC-CRF loss function binding to PyTorch.
Supports one-stop training and inference for CTC, CTC-CRF, RNN-T, and LM.
Flexible configuration via JSON files.
Scalable and extensible for large datasets and custom models.
Achieves competitive performance on various benchmarks (e.g., 2.77% WER on WSJ eval92).

Maintenance & Community

The project is associated with the Speech Processing and Machine Intelligence (SPMI) group at Tsinghua University. Key publications are cited, indicating academic backing.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The toolkit requires a CUDA-enabled NVIDIA GPU. While Kaldi is optional, its absence might limit certain advanced CTC-CRF training functionalities. The licensing status requires clarification for commercial applications.

CAT by thu-spmi

Explore Similar Projects

OLMoASR by allenai

segformer-pytorch by bubbliiiing

attention-lvcsr by rizar

TensorflowASR by Z-yq

espresso by freewym

athena by athena-team

STT by coqui-ai

lightseq by bytedance

icefall by k2-fsa

speech-to-text-wavenet by buriburisuri

TTS by mozilla

tensor2tensor by tensorflow