whispering by Sharrnah

Live transcription/translation tool with OSC and Websocket support

Created 3 years ago

491 stars

Top 63.0% on SourcePulse

Project Summary

Whispering Tiger is an open-source, locally-run tool for real-time speech-to-text transcription and translation, as well as optical character recognition (OCR) and text-to-speech (TTS). It targets streamers, VRChat users, and developers needing live audio/visual processing, offering integration via WebSockets and OSC for overlays and in-app use.

How It Works

The project leverages multiple state-of-the-art AI models for its core functionalities. For speech processing, it supports OpenAI's Whisper, Meta's Seamless M4T, Microsoft's Speech T5, and NVIDIA's NeMo Canary, enabling transcription and translation across numerous languages. OCR is handled by EasyOCR and Microsoft's Phi-4 Multimodal LLM, capturing text from screen images. TTS capabilities are provided by Silero F5/E2-TTS, Kokoro TTS, and Zonos TTS, with voice cloning support. The architecture is designed for local execution, minimizing latency and privacy concerns once models are downloaded.

Quick Start & Requirements

Installation: Download standalone releases from the GitHub Releases page.
Prerequisites: CUDA for GPU acceleration is recommended. Models can consume up to 20 GB of disk space.
Usage: Execute provided .bat files (e.g., start-transcribe-mic.bat) and configure parameters via text editor or command-line flags. A native UI application is available at https://github.com/Sharrnah/whispering-ui for easier management.

Highlighted Details

Supports 98 languages for transcription and up to 200 for translation via models like NLLB-200 and Seamless M4T.
Integrates OCR for in-game text capture and translation.
Features TTS with voice cloning and RVC (Retrieval-based Voice Conversion).
Includes LLM integration for text continuation and Q&A via plugins.

Maintenance & Community

The project acknowledges contributions from OpenAI, Meta, Microsoft, and others. Community links are not explicitly provided in the README.

Licensing & Compatibility

The project's licensing is not explicitly stated in the provided README text.

Limitations & Caveats

Initial model downloads can be substantial (up to 20 GB). The README mentions a 2 GB limit on GitHub releases, necessitating downloads from external links. Some LLM integrations are noted as proof-of-concept.

whispering by Sharrnah

Explore Similar Projects

echogarden by echogarden-project

openlrc by zh-plus

babelfish.ai by supabase-community

AudioToText by Carleslc

Speech-Translate by Dadangdut33

generate-subtitles by mayeaux

LanguageLeapAI by SociallyIneptWeeb

writeout.ai by beyondcode

voice-pro by abus-aikorea

pot-desktop by pot-app

seamless_communication by facebookresearch

pyvideotrans by jianchang512