jt-live-whisper  by jasoncheng7115

On-device AI voice toolkit for real-time transcription, translation, and meeting summarization

Created 1 month ago
331 stars

Top 83.0% on SourcePulse

GitHubView on GitHub
Project Summary

This toolkit addresses the need for fully on-device AI voice processing, offering real-time transcription, translation, speaker diarization, and meeting summarization without relying on cloud services. It targets users prioritizing data privacy, security, and cost savings, enabling AI-powered audio analysis for sensitive meetings or any application's audio output.

How It Works

The project employs a modular architecture, integrating various open-source AI models for speech recognition (Whisper variants, Moonshine), translation (NLLB, local LLMs via Ollama), and summarization (local LLMs). Audio is captured at the system level using virtual audio drivers (BlackHole on macOS, WASAPI Loopback on Windows), allowing processing of any application's audio stream. All AI inference occurs locally or on a private GPU server, ensuring data never leaves the user's control.

Quick Start & Requirements

  • Installation: One-click installation scripts (install.sh for macOS, install.ps1 for Windows) automate the download and setup of AI models and dependencies.
  • OS: macOS (Apple Silicon/Intel), Windows 10+.
  • Python: 3.12+.
  • Dependencies: Homebrew (macOS), PowerShell 5.1+ (Windows), BlackHole 2ch (macOS, auto-installed). An LLM server (e.g., Ollama) is recommended for translation/summarization but optional for basic transcription/translation using NLLB/Argos.
  • Estimated Setup Time: 10-20 minutes for initial model downloads.
  • Links: macOS install script: https://raw.githubusercontent.com/jasoncheng7115/jt-live-whisper/main/install.sh; Windows install script: https://raw.githubusercontent.com/jasoncheng7115/jt-live-whisper/main/install.ps1.

Highlighted Details

  • Real-time system audio transcription and translation (e.g., English to Chinese, Japanese to Chinese).
  • Batch processing of audio files (mp3, wav, m4a, flac) with faster-whisper.
  • Speaker diarization using resemblyzer and spectralcluster.
  • AI meeting summarization and generation of timestamped, speaker-attributed transcripts via local LLMs.
  • Support for multiple ASR engines (Whisper, Moonshine) and translation engines (LLM, NLLB, Argos).
  • A WebUI accessible via browser (including mobile/tablet) for configuration and operation.
  • Optional microphone transcription and dual-language subtitle modes.

Maintenance & Community

The project is maintained by Jason Cheng (Jason Tools). No specific community channels (like Discord/Slack) or major contributor/sponsorship information is detailed in the README.

Licensing & Compatibility

The project is licensed under the Apache License 2.0. This permissive license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

Speaker diarization accuracy may vary with audio quality and speaker voice similarity. Translation quality is dependent on the chosen engine, with local LLMs offering the best results but requiring dedicated setup. Summarization functionality necessitates a local LLM server; offline engines only support transcription and translation. Performance is heavily influenced by local hardware capabilities (CPU/GPU). macOS users must configure specific audio routing via "Audio MIDI Setup."

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
330 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.