transcribe-anything  by zackees

CLI tool for Whisper AI transcription and translation

Created 4 years ago
1,105 stars

Top 34.5% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a user-friendly, multi-backend interface for Whisper AI transcription, designed for ease of use and speed. It targets users needing to transcribe audio or video files, including those from URLs, with features like speaker diarization and GPU acceleration. The primary benefit is a simplified, private transcription workflow with optimized performance.

How It Works

The application leverages various Whisper backends, including OpenAI's original model (cuda), the highly optimized insanely-fast-whisper (insane), and Apple's whisper-mps (mps) for Mac ARM acceleration. It uses yt-dlp for URL handling and static-ffmpeg for media processing. A key differentiator is its ability to generate a speaker.json file, which segments conversations by speaker, achieved through Hugging Face integration and pyannote.audio. Environment isolation via uv ensures dependency management and faster installs.

Quick Start & Requirements

  • Install via pip: pip install transcribe-anything
  • Usage: transcribe-anything <URL_or_FILE> [--device <cuda|insane|mps|cpu>]
  • GPU acceleration (cuda, insane) is automatic on Windows/Linux. Mac users can use --device mps.
  • For speaker diarization, a Hugging Face token is required, and users must agree to pyannote.audio policies.
  • Python 3.10+ is recommended.

Highlighted Details

  • Offers multiple backends for optimized speed (insane, mps).
  • Unique feature: Generates speaker.json for speaker-attributed transcriptions.
  • Supports transcription and translation tasks.
  • Can embed subtitles directly into video files (--embed).

Maintenance & Community

The project is actively maintained by Zackees. Recent updates focus on improving backend compatibility, fixing dependency issues (e.g., with NumPy 2.0), and enhancing features like MPS support and speaker.json generation.

Licensing & Compatibility

The project appears to be MIT licensed, allowing for commercial use and integration with closed-source projects.

Limitations & Caveats

The insane backend, while fast, can be memory-intensive and may lead to out-of-memory errors on GPUs with less VRAM. The mps backend is English-only and does not support speaker.json. Python 3.12 is not yet fully supported in the backend. Experimental features like insane mode with large-v3 and batching may produce lower-quality transcriptions with timestamp misalignment.

Health Check
Last Commit

6 days ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
39 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Travis Fischer Travis Fischer(Founder of Agentic).

RealtimeSTT by KoljaB

0.5%
9k
Speech-to-text library for realtime applications
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.