Live transcription/translation tool with OSC and Websocket support
Top 67.3% on sourcepulse
Whispering Tiger is an open-source, locally-run tool for real-time speech-to-text transcription and translation, as well as optical character recognition (OCR) and text-to-speech (TTS). It targets streamers, VRChat users, and developers needing live audio/visual processing, offering integration via WebSockets and OSC for overlays and in-app use.
How It Works
The project leverages multiple state-of-the-art AI models for its core functionalities. For speech processing, it supports OpenAI's Whisper, Meta's Seamless M4T, Microsoft's Speech T5, and NVIDIA's NeMo Canary, enabling transcription and translation across numerous languages. OCR is handled by EasyOCR and Microsoft's Phi-4 Multimodal LLM, capturing text from screen images. TTS capabilities are provided by Silero F5/E2-TTS, Kokoro TTS, and Zonos TTS, with voice cloning support. The architecture is designed for local execution, minimizing latency and privacy concerns once models are downloaded.
Quick Start & Requirements
.bat
files (e.g., start-transcribe-mic.bat
) and configure parameters via text editor or command-line flags. A native UI application is available at https://github.com/Sharrnah/whispering-ui
for easier management.Highlighted Details
Maintenance & Community
The project acknowledges contributions from OpenAI, Meta, Microsoft, and others. Community links are not explicitly provided in the README.
Licensing & Compatibility
The project's licensing is not explicitly stated in the provided README text.
Limitations & Caveats
Initial model downloads can be substantial (up to 20 GB). The README mentions a 2 GB limit on GitHub releases, necessitating downloads from external links. Some LLM integrations are noted as proof-of-concept.
3 days ago
1 day