tinydiarize by akashmjn

Finetuned speech model for speaker diarization

Created 2 years ago

532 stars

Top 59.5% on SourcePulse

View on GitHub

4 Experts Love This Project

Tobi Lutke

Cofounder of Shopify

Jonathan Ragan-Kelley

Professor at MIT

Benjamin Bolte

Cofounder of K-Scale Labs

Georgi Gerganov

Author of llama.cpp, whisper.cpp

Project Summary

This project provides a minimal extension to OpenAI's Whisper for speaker diarization, labeling who spoke when in transcripts. It's designed for researchers and developers working with conversational audio like meetings and podcasts, offering a lightweight and interpretable solution that integrates seamlessly with Whisper.

How It Works

The approach fine-tunes Whisper models to incorporate special tokens that denote speaker changes. This method leverages both voice and semantic context for improved speaker differentiation, a unique advantage over traditional diarization techniques. The minimal changes required (<50 lines) make it an efficient and cost-effective solution.

Quick Start & Requirements

Install ffmpeg.
Run: pip install -e .
Use the small.en-tdrz model: whisper --model small.en-tdrz AUDIO
Dependencies are the same as the original Whisper repository.
Official Demo: https://user-images.githubusercontent.com/13268767/229617067-eca0f614-d334-480d-9801-7c30d88acdc6.mp4
Notebook for analysis: Linked within the README.

Highlighted Details

Achieves 97.7% speaker turn precision and 70.8% recall on a small benchmark set.
Maintains a similar Word Error Rate (WER) to the original Whisper model (10.3% vs 11.0%).
Fine-tuning requires minimal resources (~30 mins on 1 GPU).
Experimental support for whisper.cpp enables running on consumer hardware.

Maintenance & Community

The project is described as a prototype/proof-of-concept, with plans for future development outlined in the roadmap. However, a recent update indicates plans have been paused.

Licensing & Compatibility

Code and model weights are released under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Currently, only the small.en English model is fine-tuned. Timestamp behavior and deletion errors may differ from the original Whisper model. The project is still considered hacky and subject to change, with global diarization (speaker clustering) planned for a later stage.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days