WhisperS2T by shashikg

Optimized speech-to-text pipeline for Whisper models

Created 2 years ago

535 stars

Top 59.2% on SourcePulse

Project Summary

WhisperS2T is an optimized speech-to-text pipeline designed to accelerate Whisper model inference. It targets researchers and developers needing faster, more accurate transcriptions, offering significant speedups over existing implementations.

How It Works

WhisperS2T achieves its speed by supporting multiple inference backends, including CTranslate2 and TensorRT-LLM, and by implementing pipeline-level optimizations. These include intelligent batching of audio segments, asynchronous loading of large files, and heuristics to reduce hallucinations. The design prioritizes efficient data flow and processing, leading to notable performance gains without sacrificing accuracy.

Quick Start & Requirements

Install: pip install -U whisper-s2t or pip install -U git+https://github.com/shashikg/WhisperS2T.git
Prerequisites: libsndfile1, ffmpeg. For TensorRT-LLM backend, TensorRT and TensorRT-LLM installation is required (via install_tensorrt.sh or official instructions). CUDA is implicitly required for GPU acceleration.
Docker: Prebuilt images available: docker pull shashikg/whisper_s2t:dev-trtllm. Build from source with docker build.
Docs: Google Colab notebooks are provided for quickstart.

Highlighted Details

Claims 2.3X speedup over WhisperX and 3X over HuggingFace Pipeline with FlashAttention 2.
Supports multiple backends: Original OpenAI, HuggingFace (with FlashAttention2), CTranslate2, and TensorRT-LLM.
Integrates custom Voice Activity Detection (VAD) models.
Offers batching for multiple languages/tasks and experimental dynamic time length support (CTranslate2 backend).
Includes heuristics to reduce text hallucinations (some specific to CTranslate2).

Maintenance & Community

The project is actively developed, with recent updates adding Docker images, transcript exporters, and TensorRT-LLM support. Future plans include a dedicated server codebase and more in-depth documentation.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Initial runs may exhibit slower inference due to JIT tracing of the VAD model. Some advanced features like word alignment and dynamic time length support are specific to the CTranslate2 backend. Benchmarks were conducted with without_timestamps=True, which may affect Word Error Rate (WER).

WhisperS2T by shashikg

Explore Similar Projects

SpeechGPT-2.0-preview by OpenMOSS

local_llm_assistant by nickbild

AIVoiceChat by KoljaB

VITA-Audio by VITA-MLLM

ollama-voice-mac by apeatling

Chatterbox-TTS-Extended by petermg

10x by 0xCrunchyy

whisper-ctranslate2 by Softcatala

swift by ai-ng

xtts-webui by daswer123

mini-omni by gpt-omni

bark by suno-ai