whisper-timestamped by linto-ai

ASR tool for word-level timestamps and confidence scores using Whisper

Created 3 years ago

2,721 stars

Top 17.3% on SourcePulse

2 Experts Love This Project

luiscape

Cofounder of Lightning AI

jxnl

Author of Instructor

Project Summary

This project provides word-level timestamps and confidence scores for multilingual Automatic Speech Recognition (ASR) using OpenAI's Whisper models. It addresses the limitation of Whisper's segment-level timestamps, offering a more granular and accurate transcription for researchers and developers working with speech data.

How It Works

The core innovation lies in using Dynamic Time Warping (DTW) on Whisper's cross-attention weights to derive word-level alignments. This approach avoids the need for language-specific models or character normalization required by other methods, and it performs alignment on-the-fly without additional inference steps, optimizing memory usage for long audio files.

Quick Start & Requirements

Install: pip3 install whisper-timestamped
Prerequisites: Python >= 3.9, ffmpeg. Optional: matplotlib, torchaudio, onnxruntime for VAD, transformers for Hugging Face models.
Docker: Provided for CPU-only and full installations.
Docs: https://github.com/linto-ai/whisper-timestamped

Highlighted Details

Word-level timestamps and confidence scores.
Optional Voice Activity Detection (VAD) to prevent hallucinations.
Support for detecting and marking speech disfluencies.
Compatible with OpenAI Whisper and Hugging Face models.
Outputs include JSON, CSV, SRT, and VTT formats with word timestamps.

Maintenance & Community

Primarily developed by Jérôme Louradour.
Based on openai-whisper (MIT) and dtw-python (GPL v3).

Licensing & Compatibility

The project itself is not explicitly licensed in the README. However, it depends on openai-whisper (MIT) and dtw-python (GPL v3). The GPL v3 license of dtw-python may impose copyleft restrictions on derivative works.

Limitations & Caveats

The README states the extension is for "experimental purposes" and may "significantly impact performance."
The GPL v3 dependency might restrict commercial use or linking with closed-source applications.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

28 stars in the last 30 days

Explore Similar Projects

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Jonathan Ragan-Kelley

Jonathan Ragan-Kelley(Professor at MIT), and

2 more.

tinydiarize by akashmjn

Finetuned speech model for speaker diarization

Created 2 years ago

Updated 2 years ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

ctc-segmentation by lumaku

Python package for audio segmentation and utterance alignment

Created 5 years ago

Updated 1 year ago

Starred by

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai).

insanely-fast-whisper-cli by ochen1

CLI tool for optimized, fast Whisper-based ASR

Created 2 years ago

Updated 1 year ago

CrisperWhisper by nyrahealth

Speech recognition for verbatim transcription with word-level timestamps

Created 1 year ago

Updated 7 months ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic).

ollama-voice-mac by apeatling

Offline voice assistant for macOS

Created 2 years ago

Updated 4 months ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI).

smart-turn by pipecat-ai

Turn detection model for conversational voice AI

Created 10 months ago

Updated 4 days ago

ASR-LLM-TTS by ABexit

Speech interaction system integrating ASR, LLM, and TTS

Created 1 year ago

Updated 10 months ago

CapsWriter-Offline by HaujetZhao

Offline voice input tool for PC, transcribing speech to text

Created 2 years ago

Updated 1 year ago

Starred by

Tim J. Baek

Tim J. Baek(Founder of Open WebUI),

Gabriel Almeida

Gabriel Almeida(Cofounder of Langflow), and

2 more.

whisper-diarization by MahmoudAshraf97

ASR pipeline for speaker diarization

Created 3 years ago

Updated 1 month ago

sherpa-onnx by k2-fsa

Speech toolkit for local, offline speech AI tasks via ONNX

Created 3 years ago

Updated 16 hours ago

FunASR by modelscope

Speech recognition toolkit for bridging research and industrial applications

Created 3 years ago

Updated 4 days ago

Starred by

Georgios Konstantopoulos

Georgios Konstantopoulos(CTO, General Partner at Paradigm),

Paul Gauthier

Paul Gauthier(Founder of Aider), and

9 more.

whisperX by m-bain

ASR tool for accurate, batched, word-level Whisper transcriptions

Created 3 years ago

Updated 2 months ago

Feedback? Help us improve.