Fast audio transcription API
Top 90.6% on sourcepulse
This project provides a highly optimized API for audio transcription using OpenAI's Whisper Large v3 model, targeting developers and businesses needing fast, scalable, and deployable speech-to-text solutions. It offers features like speaker diarization, asynchronous task management, and robust concurrency, significantly reducing transcription times.
How It Works
The API leverages Hugging Face Transformers, Optimum, and flash-attn for accelerated inference. It employs fp16 precision, batching (up to 24 concurrent requests), and Flash Attention 2 for significant speedups. Speaker diarization is integrated via pyannote models, requiring Hugging Face authentication. The architecture is designed for high concurrency and parallel processing, making it suitable for production workloads.
Quick Start & Requirements
yoeven/insanely-fast-whisper-api:latest
or build from source. The README provides detailed instructions for Fly.io deployment (fly launch
, fly secrets set ADMIN_KEY=<your_token> HF_TOKEN=<your_hf_key>
).flash-attn
with specific build requirements), and running uvicorn app.app:app
.Highlighted Details
Maintenance & Community
The project is part of JigsawStack, which offers managed APIs. The core code is based on the Insanely Fast Whisper CLI project by Vaibhav Srivastav. Community links are not explicitly provided in the README.
Licensing & Compatibility
The project is open source and deployable on any GPU cloud provider supporting Docker. Specific licensing details (e.g., MIT, Apache) are not explicitly stated in the README, but its open-source nature suggests broad compatibility for commercial use and closed-source linking.
Limitations & Caveats
The large Docker image size can lead to long initial deployment times. Speaker diarization requires accepting user conditions and providing a Hugging Face token. Fly.io machines may take up to 15 minutes to auto-shut down after idling, incurring costs if not manually stopped.
8 months ago
1+ week