stt by jianchang512

Offline speech-to-text tool for local audio/video transcription

Created 2 years ago

4,180 stars

Top 11.7% on SourcePulse

Project Summary

This tool provides an offline, local voice recognition service that converts audio/video into text, with support for JSON, SRT, and plain text output formats. It's designed for users needing to self-host a speech-to-text solution, offering accuracy comparable to OpenAI's API, and is particularly useful for developers and researchers working with audio data.

How It Works

The project leverages the faster-whisper open-source model, known for its efficiency and accuracy. It supports various model sizes (tiny to large-v3), allowing users to balance performance with computational resource requirements. The tool operates as a local web service, accessible via a browser interface or an API, and automatically utilizes NVIDIA GPU acceleration via CUDA if configured.

Quick Start & Requirements

Installation: Download pre-compiled Windows binaries from Releases or clone the repository and install dependencies (pip install -r requirements.txt).
Prerequisites: Python 3.9-3.11, FFmpeg (Windows users need to extract ffmpeg.exe and ffprobe.exe to the project directory). NVIDIA GPU with CUDA 11.x/12.x toolkit and cuDNN for GPU acceleration.
Models: Download model archives and place them in the models directory.
Running: Execute python start.py to launch the local web UI.
Docs: English README

Highlighted Details

Offers an API endpoint for programmatic access.
Supports multiple languages for transcription.
Automatic detection and utilization of CUDA for NVIDIA GPUs.
Provides options for different output formats (JSON, SRT, text).

Maintenance & Community

Active development with a Discord community available via invite.
Project acknowledges dependencies on faster-whisper, Flask, and FFmpeg.

Licensing & Compatibility

The specific license is not explicitly stated in the README, but it relies on other open-source projects with their own licenses. Compatibility for commercial use or closed-source linking should be verified.

Limitations & Caveats

Chinese language output may sometimes be in Traditional Chinese.
GPU acceleration requires careful setup of CUDA and cuDNN; incorrect configuration can lead to errors or crashes.
Large models (large-v3) demand significant GPU VRAM (8GB+ recommended), and insufficient VRAM can cause crashes, especially with larger audio/video files.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

1

Star History

101 stars in the last 30 days

Explore Similar Projects

whispering-ui by Sharrnah

Native UI for live audio transcription/translation

Created 3 years ago

Updated 1 week ago

Starred by

Georgi Gerganov

Georgi Gerganov(Author of llama.cpp, whisper.cpp).

easy-whisper-ui by mehtabmahir

Desktop app for fast, GPU-accelerated audio/video transcription

Created 9 months ago

Updated 1 day ago

Auralis by astramind-ai

TTS engine for fast voice cloning

Created 1 year ago

Updated 11 months ago

bulk_transcribe_youtube_videos_from_playlist by Dicklesworthstone

CLI tool for bulk YouTube video transcription

Created 2 years ago

Updated 10 months ago

Speech-Translate by Dadangdut33

Speech-to-text app using Whisper for transcription and translation

Created 3 years ago

Updated 2 years ago

obs-localvocal by royshil

OBS plugin for local speech recognition and captioning

Created 2 years ago

Updated 6 days ago

Starred by

Matt Schrage

Matt Schrage(Cofounder of Fig).

WAAS by schibsted

Whisper-as-a-Service provides GUI/API access to OpenAI's Whisper model

Created 3 years ago

Updated 2 days ago

Starred by

Georgi Gerganov

Georgi Gerganov(Author of llama.cpp, whisper.cpp).

whisper.net by sandrohanea

.NET library for speech-to-text using Whisper models

Created 2 years ago

Updated 1 week ago

sherpa-ncnn by k2-fsa

Offline STT engine for real-time speech recognition and VAD

Created 3 years ago

Updated 2 months ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI) and

Matt Schrage

Matt Schrage(Cofounder of Fig).

WhisperLive by collabora

Real-time transcription app using OpenAI's Whisper

Created 2 years ago

Updated 3 months ago

whisper-asr-webservice by ahmetoner

ASR webservice API for speech recognition, translation, and language ID

Created 3 years ago

Updated 1 month ago

Starred by

Georgi Gerganov

Georgi Gerganov(Author of llama.cpp, whisper.cpp),

Travis Fischer

Travis Fischer(Founder of Agentic), and

1 more.

buzz by chidiwilliams

Desktop app for offline audio transcription and translation

Created 3 years ago

Updated 1 day ago

Feedback? Help us improve.