stt  by jianchang512

Offline speech-to-text tool for local audio/video transcription

created 1 year ago
3,667 stars

Top 13.5% on sourcepulse

GitHubView on GitHub
Project Summary

This tool provides an offline, local voice recognition service that converts audio/video into text, with support for JSON, SRT, and plain text output formats. It's designed for users needing to self-host a speech-to-text solution, offering accuracy comparable to OpenAI's API, and is particularly useful for developers and researchers working with audio data.

How It Works

The project leverages the faster-whisper open-source model, known for its efficiency and accuracy. It supports various model sizes (tiny to large-v3), allowing users to balance performance with computational resource requirements. The tool operates as a local web service, accessible via a browser interface or an API, and automatically utilizes NVIDIA GPU acceleration via CUDA if configured.

Quick Start & Requirements

  • Installation: Download pre-compiled Windows binaries from Releases or clone the repository and install dependencies (pip install -r requirements.txt).
  • Prerequisites: Python 3.9-3.11, FFmpeg (Windows users need to extract ffmpeg.exe and ffprobe.exe to the project directory). NVIDIA GPU with CUDA 11.x/12.x toolkit and cuDNN for GPU acceleration.
  • Models: Download model archives and place them in the models directory.
  • Running: Execute python start.py to launch the local web UI.
  • Docs: English README

Highlighted Details

  • Offers an API endpoint for programmatic access.
  • Supports multiple languages for transcription.
  • Automatic detection and utilization of CUDA for NVIDIA GPUs.
  • Provides options for different output formats (JSON, SRT, text).

Maintenance & Community

  • Active development with a Discord community available via invite.
  • Project acknowledges dependencies on faster-whisper, Flask, and FFmpeg.

Licensing & Compatibility

  • The specific license is not explicitly stated in the README, but it relies on other open-source projects with their own licenses. Compatibility for commercial use or closed-source linking should be verified.

Limitations & Caveats

  • Chinese language output may sometimes be in Traditional Chinese.
  • GPU acceleration requires careful setup of CUDA and cuDNN; incorrect configuration can lead to errors or crashes.
  • Large models (large-v3) demand significant GPU VRAM (8GB+ recommended), and insufficient VRAM can cause crashes, especially with larger audio/video files.
Health Check
Last commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
3
Star History
334 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.