TheWhisper  by TheStageAI

Optimized speech-to-text inference for streaming and on-device use

Created 2 months ago
781 stars

Top 45.0% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides optimized Whisper models for efficient speech-to-text inference, focusing on streaming and on-device deployment. It targets developers needing self-hosting, cloud hosting, or edge solutions for real-time captioning and voice interfaces, offering low-latency, low-power, and scalable transcription. The project delivers high performance on NVIDIA GPUs and Apple Silicon through specialized inference engines.

How It Works

The project offers fine-tuned Whisper models supporting flexible chunk sizes (10s, 15s, 20s, 30s), overcoming the original models' fixed 30s limit. It leverages high-performance inference engines for NVIDIA GPUs (TheStage AI ElasticModels), claiming up to 220 tokens/sec on L40s for whisper-large-v3. For macOS and Apple Silicon, it provides CoreML engines optimized for minimal power consumption (~2W) and RAM usage (~2GB). Streaming inference is implemented for both NVIDIA and macOS platforms, enabling real-time transcription capabilities.

Quick Start & Requirements

Clone the repository and cd TheWhisper. Install platform-specific packages: pip install .[apple] or pip install .[nvidia]. For TheStage AI optimized NVIDIA engines, additionally install thestage-elastic-models[nvidia] (requires pip install thestage and thestage config set --api-token <YOUR_API_TOKEN>). flash_attn==2.8.2 is a required dependency for Nvidia.

  • NVIDIA Prerequisites: Ubuntu 20.04+, Python 3.10-3.12, CUDA 11.8+, Driver 520.0+, 2.5 GB RAM (5 GB recommended). Supported GPUs include RTX 4090, L40s.
  • Apple Silicon Prerequisites: macOS 15.0+ / iOS 18.0+, Python 3.10-3.12, M1/M2/M3/M4 series chips, 2 GB RAM (4 GB recommended).
  • Links: Electron demo asset: https://github.com/user-attachments/assets/f4d3fe7b-e2c5-42ff-a5d0-fef6afd11684. Placeholder for React frontend example.

Highlighted Details

  • High-performance TheStage AI inference engines for NVIDIA GPUs achieve up to 220 tok/s on L40s for whisper-large-v3.
  • CoreML engines for macOS/Apple Silicon offer industry-leading low power consumption (~2W) and RAM usage (~2GB).
  • Supports flexible chunk sizes (10s, 15s, 20s, 30s) for transcription, unlike original Whisper models.
  • Streaming inference is supported for both NVIDIA and macOS platforms.

Maintenance & Community

The README does not provide links to community channels (e.g., Discord, Slack) or a public roadmap. Acknowledgements are made to Silero VAD, OpenAI Whisper, Hugging Face Transformers, and the MLX community. No specific contributors, sponsorships, or partnerships are highlighted.

Licensing & Compatibility

The Pytorch HF Transformers (NVIDIA) and CoreML (macOS) engines are provided free of charge. TheStage AI optimized NVIDIA engines are free for small organizations (≤ 4 GPUs/year). Commercial use of TheStage AI optimized NVIDIA engines for larger deployments requires contacting TheStage AI for a service request and explicit licensing.

Limitations & Caveats

Streaming inference is reportedly not supported for whisper-large-v3-turbo on NVIDIA platforms. Word timestamp generation is unavailable for whisper-large-v3 on NVIDIA. The provided link for the React frontend example is a placeholder, and a direct download link for "TheNotes for macOS" is not present. The optimized NVIDIA engines require API token configuration and may necessitate a commercial license for extensive use.

Health Check
Last Commit

19 hours ago

Responsiveness

Inactive

Pull Requests (30d)
13
Issues (30d)
1
Star History
44 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.