TheWhisper  by TheStageAI

Optimized speech-to-text inference for streaming and on-device use

Created 2 weeks ago

New!

332 stars

Top 82.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides optimized Whisper models for efficient speech-to-text inference, focusing on streaming and on-device deployment. It targets developers needing self-hosting, cloud hosting, or edge solutions for real-time captioning and voice interfaces, offering low-latency, low-power, and scalable transcription. The project delivers high performance on NVIDIA GPUs and Apple Silicon through specialized inference engines.

How It Works

The project offers fine-tuned Whisper models supporting flexible chunk sizes (10s, 15s, 20s, 30s), overcoming the original models' fixed 30s limit. It leverages high-performance inference engines for NVIDIA GPUs (TheStage AI ElasticModels), claiming up to 220 tokens/sec on L40s for whisper-large-v3. For macOS and Apple Silicon, it provides CoreML engines optimized for minimal power consumption (~2W) and RAM usage (~2GB). Streaming inference is implemented for both NVIDIA and macOS platforms, enabling real-time transcription capabilities.

Quick Start & Requirements

Clone the repository and cd TheWhisper. Install platform-specific packages: pip install .[apple] or pip install .[nvidia]. For TheStage AI optimized NVIDIA engines, additionally install thestage-elastic-models[nvidia] (requires pip install thestage and thestage config set --api-token <YOUR_API_TOKEN>). flash_attn==2.8.2 is a required dependency for Nvidia.

  • NVIDIA Prerequisites: Ubuntu 20.04+, Python 3.10-3.12, CUDA 11.8+, Driver 520.0+, 2.5 GB RAM (5 GB recommended). Supported GPUs include RTX 4090, L40s.
  • Apple Silicon Prerequisites: macOS 15.0+ / iOS 18.0+, Python 3.10-3.12, M1/M2/M3/M4 series chips, 2 GB RAM (4 GB recommended).
  • Links: Electron demo asset: https://github.com/user-attachments/assets/f4d3fe7b-e2c5-42ff-a5d0-fef6afd11684. Placeholder for React frontend example.

Highlighted Details

  • High-performance TheStage AI inference engines for NVIDIA GPUs achieve up to 220 tok/s on L40s for whisper-large-v3.
  • CoreML engines for macOS/Apple Silicon offer industry-leading low power consumption (~2W) and RAM usage (~2GB).
  • Supports flexible chunk sizes (10s, 15s, 20s, 30s) for transcription, unlike original Whisper models.
  • Streaming inference is supported for both NVIDIA and macOS platforms.

Maintenance & Community

The README does not provide links to community channels (e.g., Discord, Slack) or a public roadmap. Acknowledgements are made to Silero VAD, OpenAI Whisper, Hugging Face Transformers, and the MLX community. No specific contributors, sponsorships, or partnerships are highlighted.

Licensing & Compatibility

The Pytorch HF Transformers (NVIDIA) and CoreML (macOS) engines are provided free of charge. TheStage AI optimized NVIDIA engines are free for small organizations (≤ 4 GPUs/year). Commercial use of TheStage AI optimized NVIDIA engines for larger deployments requires contacting TheStage AI for a service request and explicit licensing.

Limitations & Caveats

Streaming inference is reportedly not supported for whisper-large-v3-turbo on NVIDIA platforms. Word timestamp generation is unavailable for whisper-large-v3 on NVIDIA. The provided link for the React frontend example is a placeholder, and a direct download link for "TheNotes for macOS" is not present. The optimized NVIDIA engines require API token configuration and may necessitate a commercial license for extensive use.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
36
Issues (30d)
1
Star History
398 stars in the last 15 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Travis Fischer Travis Fischer(Founder of Agentic).

RealtimeSTT by KoljaB

0.4%
9k
Speech-to-text library for realtime applications
Created 2 years ago
Updated 3 months ago
Feedback? Help us improve.