RealtimeSTT_LLM_TTS by Ikaros-521

Realtime STT/TTS pipeline for cross-network, real-time conversations

Created 1 year ago

430 stars

Top 69.0% on SourcePulse

Project Summary

This project provides a real-time speech-to-text (STT) system designed for voice assistants and applications requiring fast, low-latency transcription. It integrates with LLM services like OpenAI and ZhipuAI, and TTS engines such as GPT-SOVITS and Edge-TTS, enabling cross-network real-time conversational experiences via a web interface.

How It Works

The system utilizes a multi-component architecture for robust voice processing. Voice Activity Detection (VAD) is handled by WebRTCVAD for initial detection and SileroVAD for verification. Speech-to-text transcription is powered by Faster-Whisper, optimized for GPU acceleration. Wake word detection is implemented using Porcupine. The project also supports streaming LLM and TTS integrations for conversational AI.

Quick Start & Requirements

Installation: pip install RealtimeSTT
GPU Support (Recommended): Requires NVIDIA CUDA Toolkit 11.8, cuDNN 8.7.0 for CUDA 11.x, and PyTorch with CUDA support (pip install torch==2.0.1+cu118 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118).
Other Dependencies: ffmpeg (installable via package managers or direct download).
WebUI: Run python webui.py.
Server: Run python RealtimeSTT_server2.py and access via index.html.
Documentation: README

Highlighted Details

Supports real-time transcription with configurable models (tiny to large-v2).
Features wake word activation (e.g., "jarvis") for triggering recordings.
Integrates with OpenAI and ZhipuAI (streaming LLM) and Edge-TTS.
Offers a web UI for configuration and cross-network service calls.
Includes callbacks for various events like recording start/stop and transcription updates.

Maintenance & Community

Recent updates include bug fixes for the web UI, custom OpenAI model configuration, and wake word activation.
The project is open for contributions.

Licensing & Compatibility

License: MIT.
Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

GPU acceleration is strongly recommended for optimal performance, especially with real-time transcription.
Some demo scripts require API keys to be set as environment variables (e.g., OPENAI_API_KEY).
The provided web UI is noted as "not complete, but usable."

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

1 stars in the last 30 days

Explore Similar Projects

Starred by

Michael Han

Michael Han(Cofounder of Unsloth).

FluidVoice by altic-dev

macOS app for local voice-to-text transcription with AI enhancement

Created 3 months ago

Updated 1 day ago

wingmanAI by e-johnstonn

Real-time transcription tool with ChatGPT integration

Created 2 years ago

Updated 2 years ago

LiveWhisper by Nikorasu

Live transcription tool using OpenAI's Whisper

Created 3 years ago

Updated 5 months ago

AIVoiceChat by KoljaB

Voice chat for low-latency AI companion interaction

Created 2 years ago

Updated 6 months ago

Starred by

Georgi Gerganov

Georgi Gerganov(Author of llama.cpp, whisper.cpp).

transcriber_app by davabase

Real-time speech-to-text transcription app

Created 3 years ago

Updated 3 years ago

fast-voice-assistant by dsa

AI voice assistant demo with <500ms response

Created 1 year ago

Updated 1 year ago

speech-to-text by reriiasu

Real-time transcription tool using faster-whisper

Created 2 years ago

Updated 1 year ago

Starred by

Emile Vauge

Emile Vauge(Founder of Traefik).

Scriberr by rishikanthc

Self-hosted app for local AI audio transcription

Created 1 year ago

Updated 4 days ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai).

whisper-writer by savbell

Dictation app using OpenAI's Whisper model for real-time transcription

Created 2 years ago

Updated 1 year ago

QuickAgent by gkamradt

Voice bot demo using speech and language models

Created 1 year ago

Updated 1 year ago

Starred by

Amin Ahmad

Amin Ahmad(Cofounder of Vectara) and

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai).

whisper_streaming by ufal

Real-time streaming for long speech-to-text transcription/translation

Created 2 years ago

Updated 2 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Travis Fischer

Travis Fischer(Founder of Agentic).

RealtimeSTT by KoljaB

Speech-to-text library for realtime applications

Created 2 years ago

Updated 6 months ago

Feedback? Help us improve.