TheWhisper  by TheStageAI

Optimized speech-to-text inference for streaming and on-device use

Created 4 months ago
818 stars

Top 43.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides optimized Whisper models for efficient speech-to-text inference, focusing on streaming and on-device deployment. It targets developers needing self-hosting, cloud hosting, or edge solutions for real-time captioning and voice interfaces, offering low-latency, low-power, and scalable transcription. The project delivers high performance on NVIDIA GPUs and Apple Silicon through specialized inference engines.

How It Works

The project offers fine-tuned Whisper models supporting flexible chunk sizes (10s, 15s, 20s, 30s), overcoming the original models' fixed 30s limit. It leverages high-performance inference engines for NVIDIA GPUs (TheStage AI ElasticModels), claiming up to 220 tokens/sec on L40s for whisper-large-v3. For macOS and Apple Silicon, it provides CoreML engines optimized for minimal power consumption (~2W) and RAM usage (~2GB). Streaming inference is implemented for both NVIDIA and macOS platforms, enabling real-time transcription capabilities.

Quick Start & Requirements

Clone the repository and cd TheWhisper. Install platform-specific packages: pip install .[apple] or pip install .[nvidia]. For TheStage AI optimized NVIDIA engines, additionally install thestage-elastic-models[nvidia] (requires pip install thestage and thestage config set --api-token <YOUR_API_TOKEN>). flash_attn==2.8.2 is a required dependency for Nvidia.

  • NVIDIA Prerequisites: Ubuntu 20.04+, Python 3.10-3.12, CUDA 11.8+, Driver 520.0+, 2.5 GB RAM (5 GB recommended). Supported GPUs include RTX 4090, L40s.
  • Apple Silicon Prerequisites: macOS 15.0+ / iOS 18.0+, Python 3.10-3.12, M1/M2/M3/M4 series chips, 2 GB RAM (4 GB recommended).
  • Links: Electron demo asset: https://github.com/user-attachments/assets/f4d3fe7b-e2c5-42ff-a5d0-fef6afd11684. Placeholder for React frontend example.

Highlighted Details

  • High-performance TheStage AI inference engines for NVIDIA GPUs achieve up to 220 tok/s on L40s for whisper-large-v3.
  • CoreML engines for macOS/Apple Silicon offer industry-leading low power consumption (~2W) and RAM usage (~2GB).
  • Supports flexible chunk sizes (10s, 15s, 20s, 30s) for transcription, unlike original Whisper models.
  • Streaming inference is supported for both NVIDIA and macOS platforms.

Maintenance & Community

The README does not provide links to community channels (e.g., Discord, Slack) or a public roadmap. Acknowledgements are made to Silero VAD, OpenAI Whisper, Hugging Face Transformers, and the MLX community. No specific contributors, sponsorships, or partnerships are highlighted.

Licensing & Compatibility

The Pytorch HF Transformers (NVIDIA) and CoreML (macOS) engines are provided free of charge. TheStage AI optimized NVIDIA engines are free for small organizations (≤ 4 GPUs/year). Commercial use of TheStage AI optimized NVIDIA engines for larger deployments requires contacting TheStage AI for a service request and explicit licensing.

Limitations & Caveats

Streaming inference is reportedly not supported for whisper-large-v3-turbo on NVIDIA platforms. Word timestamp generation is unavailable for whisper-large-v3 on NVIDIA. The provided link for the React frontend example is a placeholder, and a direct download link for "TheNotes for macOS" is not present. The optimized NVIDIA engines require API token configuration and may necessitate a commercial license for extensive use.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
10
Issues (30d)
0
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
3 more.

voxtral.c by antirez

5.3%
1k
Pure C speech-to-text inference engine for Mistral Voxtral Realtime 4B
Created 2 weeks ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
1 more.

moonshine by moonshine-ai

9.0%
4k
Speech-to-text models optimized for fast, accurate ASR on edge devices
Created 1 year ago
Updated 2 days ago
Feedback? Help us improve.