Optimized speech-to-text pipeline for Whisper models
Top 68.5% on sourcepulse
WhisperS2T is an optimized speech-to-text pipeline designed to accelerate Whisper model inference. It targets researchers and developers needing faster, more accurate transcriptions, offering significant speedups over existing implementations.
How It Works
WhisperS2T achieves its speed by supporting multiple inference backends, including CTranslate2 and TensorRT-LLM, and by implementing pipeline-level optimizations. These include intelligent batching of audio segments, asynchronous loading of large files, and heuristics to reduce hallucinations. The design prioritizes efficient data flow and processing, leading to notable performance gains without sacrificing accuracy.
Quick Start & Requirements
pip install -U whisper-s2t
or pip install -U git+https://github.com/shashikg/WhisperS2T.git
libsndfile1
, ffmpeg
. For TensorRT-LLM backend, TensorRT and TensorRT-LLM installation is required (via install_tensorrt.sh
or official instructions). CUDA is implicitly required for GPU acceleration.docker pull shashikg/whisper_s2t:dev-trtllm
. Build from source with docker build
.Highlighted Details
Maintenance & Community
The project is actively developed, with recent updates adding Docker images, transcript exporters, and TensorRT-LLM support. Future plans include a dedicated server codebase and more in-depth documentation.
Licensing & Compatibility
Licensed under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
Initial runs may exhibit slower inference due to JIT tracing of the VAD model. Some advanced features like word alignment and dynamic time length support are specific to the CTranslate2 backend. Benchmarks were conducted with without_timestamps=True
, which may affect Word Error Rate (WER).
11 months ago
1 day