jt-live-whisper by jasoncheng7115

On-device AI voice toolkit for real-time transcription, translation, and meeting summarization

Created 4 months ago

395 stars

Top 72.6% on SourcePulse

Project Summary

This toolkit addresses the need for fully on-device AI voice processing, offering real-time transcription, translation, speaker diarization, and meeting summarization without relying on cloud services. It targets users prioritizing data privacy, security, and cost savings, enabling AI-powered audio analysis for sensitive meetings or any application's audio output.

How It Works

The project employs a modular architecture, integrating various open-source AI models for speech recognition (Whisper variants, Moonshine), translation (NLLB, local LLMs via Ollama), and summarization (local LLMs). Audio is captured at the system level using virtual audio drivers (BlackHole on macOS, WASAPI Loopback on Windows), allowing processing of any application's audio stream. All AI inference occurs locally or on a private GPU server, ensuring data never leaves the user's control.

Quick Start & Requirements

Installation: One-click installation scripts (install.sh for macOS, install.ps1 for Windows) automate the download and setup of AI models and dependencies.
OS: macOS (Apple Silicon/Intel), Windows 10+.
Python: 3.12+.
Dependencies: Homebrew (macOS), PowerShell 5.1+ (Windows), BlackHole 2ch (macOS, auto-installed). An LLM server (e.g., Ollama) is recommended for translation/summarization but optional for basic transcription/translation using NLLB/Argos.
Estimated Setup Time: 10-20 minutes for initial model downloads.
Links: macOS install script: https://raw.githubusercontent.com/jasoncheng7115/jt-live-whisper/main/install.sh; Windows install script: https://raw.githubusercontent.com/jasoncheng7115/jt-live-whisper/main/install.ps1.

Highlighted Details

Real-time system audio transcription and translation (e.g., English to Chinese, Japanese to Chinese).
Batch processing of audio files (mp3, wav, m4a, flac) with faster-whisper.
Speaker diarization using resemblyzer and spectralcluster.
AI meeting summarization and generation of timestamped, speaker-attributed transcripts via local LLMs.
Support for multiple ASR engines (Whisper, Moonshine) and translation engines (LLM, NLLB, Argos).
A WebUI accessible via browser (including mobile/tablet) for configuration and operation.
Optional microphone transcription and dual-language subtitle modes.

Maintenance & Community

The project is maintained by Jason Cheng (Jason Tools). No specific community channels (like Discord/Slack) or major contributor/sponsorship information is detailed in the README.

Licensing & Compatibility

The project is licensed under the Apache License 2.0. This permissive license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

Speaker diarization accuracy may vary with audio quality and speaker voice similarity. Translation quality is dependent on the chosen engine, with local LLMs offering the best results but requiring dedicated setup. Summarization functionality necessitates a local LLM server; offline engines only support transcription and translation. Performance is heavily influenced by local hardware capabilities (CPU/GPU). macOS users must configure specific audio routing via "Audio MIDI Setup."

jt-live-whisper by jasoncheng7115

Explore Similar Projects

HoldSpeak by karolswdev

speechlib by NavodPeiris

typeflux by mylxsw

izwi by izwi-ai

millet by pretyflaco

alibabacloud-bailian-speech-demo by aliyun

awesome-ai-voice by wildminder

AIVoiceChat by KoljaB

ollama-voice-mac by apeatling

FluidVoice by altic-dev

elevenlabs-python by elevenlabs

sherpa-onnx by k2-fsa