RealtimeSTT_LLM_TTS  by Ikaros-521

Realtime STT/TTS pipeline for cross-network, real-time conversations

created 1 year ago
406 stars

Top 72.7% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a real-time speech-to-text (STT) system designed for voice assistants and applications requiring fast, low-latency transcription. It integrates with LLM services like OpenAI and ZhipuAI, and TTS engines such as GPT-SOVITS and Edge-TTS, enabling cross-network real-time conversational experiences via a web interface.

How It Works

The system utilizes a multi-component architecture for robust voice processing. Voice Activity Detection (VAD) is handled by WebRTCVAD for initial detection and SileroVAD for verification. Speech-to-text transcription is powered by Faster-Whisper, optimized for GPU acceleration. Wake word detection is implemented using Porcupine. The project also supports streaming LLM and TTS integrations for conversational AI.

Quick Start & Requirements

  • Installation: pip install RealtimeSTT
  • GPU Support (Recommended): Requires NVIDIA CUDA Toolkit 11.8, cuDNN 8.7.0 for CUDA 11.x, and PyTorch with CUDA support (pip install torch==2.0.1+cu118 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118).
  • Other Dependencies: ffmpeg (installable via package managers or direct download).
  • WebUI: Run python webui.py.
  • Server: Run python RealtimeSTT_server2.py and access via index.html.
  • Documentation: README

Highlighted Details

  • Supports real-time transcription with configurable models (tiny to large-v2).
  • Features wake word activation (e.g., "jarvis") for triggering recordings.
  • Integrates with OpenAI and ZhipuAI (streaming LLM) and Edge-TTS.
  • Offers a web UI for configuration and cross-network service calls.
  • Includes callbacks for various events like recording start/stop and transcription updates.

Maintenance & Community

  • Recent updates include bug fixes for the web UI, custom OpenAI model configuration, and wake word activation.
  • The project is open for contributions.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

  • GPU acceleration is strongly recommended for optimal performance, especially with real-time transcription.
  • Some demo scripts require API keys to be set as environment variables (e.g., OPENAI_API_KEY).
  • The provided web UI is noted as "not complete, but usable."
Health Check
Last commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
37 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Travis Fischer Travis Fischer(Founder of Agentic).

RealtimeSTT by KoljaB

0.9%
8k
Speech-to-text library for realtime applications
created 1 year ago
updated 3 weeks ago
Feedback? Help us improve.