auto-caption by HiMeditator

Real-time subtitle display for cross-platform use

Created 9 months ago

484 stars

Top 63.5% on SourcePulse

Project Summary

Auto Caption is a cross-platform, real-time subtitle display application designed to generate captions from audio input. It targets users needing live transcription and offers flexible integration with various speech-to-text and translation engines, enhancing accessibility and communication.

How It Works

The software captures system or microphone audio, processing it through selectable speech-to-text (STT) engines: cloud-based Gummy (Alibaba), local Vosk, or local SOSV (Sherpa-ONNX SenseVoice). It supports optional translation via local Ollama LLMs or Google Translate API. Key architectural choices include modular engine support, extensive subtitle styling, and cross-platform compatibility (Windows, macOS, Linux) with multi-language UI.

Quick Start & Requirements

Installation: Download from GitHub Releases.
Prerequisites: Python >= 3.10 (recommended 3.12) for engine building. Node.js for development. Local STT/translation models (Vosk, SOSV, Ollama) require separate downloads and configuration. Alibaba Cloud API KEY needed for Gummy. macOS/Linux require additional system audio configuration (refer to Auto Caption 用户手册).
Resources: Setup time and resource footprint vary significantly based on chosen local models.
Documentation: Auto Caption 用户手册, 字幕引擎说明文档, 更新日志.

Highlighted Details

Supports over 30 languages via Vosk, 10 via Gummy, and 5 via SOSV.
Offers real-time translation capabilities using Ollama or Google Translate API.
Features extensive subtitle styling options (font, size, color, background).
Allows subtitle recording and export in .srt and .json formats.

Maintenance & Community

The project has released v1.0.0 and plans further engine development. No specific community channels, contributor details, or sponsorship information were provided in the README.

Licensing & Compatibility

License type and compatibility details are not specified in the provided README content.

Limitations & Caveats

System audio capture on macOS and Linux requires extra configuration. Vosk's recognition quality is noted as poor, lacking punctuation. Gummy's availability may be restricted outside China. Ollama translation performance depends heavily on model size, with smaller models (<1B parameters) recommended to mitigate latency and resource consumption. Google Translate API availability is region-dependent.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

32 stars in the last 30 days