WhisperS2T  by shashikg

Optimized speech-to-text pipeline for Whisper models

created 1 year ago
445 stars

Top 68.5% on sourcepulse

GitHubView on GitHub
Project Summary

WhisperS2T is an optimized speech-to-text pipeline designed to accelerate Whisper model inference. It targets researchers and developers needing faster, more accurate transcriptions, offering significant speedups over existing implementations.

How It Works

WhisperS2T achieves its speed by supporting multiple inference backends, including CTranslate2 and TensorRT-LLM, and by implementing pipeline-level optimizations. These include intelligent batching of audio segments, asynchronous loading of large files, and heuristics to reduce hallucinations. The design prioritizes efficient data flow and processing, leading to notable performance gains without sacrificing accuracy.

Quick Start & Requirements

  • Install: pip install -U whisper-s2t or pip install -U git+https://github.com/shashikg/WhisperS2T.git
  • Prerequisites: libsndfile1, ffmpeg. For TensorRT-LLM backend, TensorRT and TensorRT-LLM installation is required (via install_tensorrt.sh or official instructions). CUDA is implicitly required for GPU acceleration.
  • Docker: Prebuilt images available: docker pull shashikg/whisper_s2t:dev-trtllm. Build from source with docker build.
  • Docs: Google Colab notebooks are provided for quickstart.

Highlighted Details

  • Claims 2.3X speedup over WhisperX and 3X over HuggingFace Pipeline with FlashAttention 2.
  • Supports multiple backends: Original OpenAI, HuggingFace (with FlashAttention2), CTranslate2, and TensorRT-LLM.
  • Integrates custom Voice Activity Detection (VAD) models.
  • Offers batching for multiple languages/tasks and experimental dynamic time length support (CTranslate2 backend).
  • Includes heuristics to reduce text hallucinations (some specific to CTranslate2).

Maintenance & Community

The project is actively developed, with recent updates adding Docker images, transcript exporters, and TensorRT-LLM support. Future plans include a dedicated server codebase and more in-depth documentation.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Initial runs may exhibit slower inference due to JIT tracing of the VAD model. Some advanced features like word alignment and dynamic time length support are specific to the CTranslate2 backend. Benchmarks were conducted with without_timestamps=True, which may affect Word Error Rate (WER).

Health Check
Last commit

11 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
47 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Travis Fischer Travis Fischer(Founder of Agentic).

RealtimeSTT by KoljaB

0.9%
8k
Speech-to-text library for realtime applications
created 1 year ago
updated 3 weeks ago
Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
19 more.

whisper by openai

0.4%
86k
Speech recognition model for multilingual transcription/translation
created 2 years ago
updated 1 month ago
Feedback? Help us improve.