Finetuned speech model for speaker diarization
Top 62.7% on sourcepulse
This project provides a minimal extension to OpenAI's Whisper for speaker diarization, labeling who spoke when in transcripts. It's designed for researchers and developers working with conversational audio like meetings and podcasts, offering a lightweight and interpretable solution that integrates seamlessly with Whisper.
How It Works
The approach fine-tunes Whisper models to incorporate special tokens that denote speaker changes. This method leverages both voice and semantic context for improved speaker differentiation, a unique advantage over traditional diarization techniques. The minimal changes required (<50 lines) make it an efficient and cost-effective solution.
Quick Start & Requirements
pip install -e .
small.en-tdrz
model: whisper --model small.en-tdrz AUDIO
Highlighted Details
whisper.cpp
enables running on consumer hardware.Maintenance & Community
The project is described as a prototype/proof-of-concept, with plans for future development outlined in the roadmap. However, a recent update indicates plans have been paused.
Licensing & Compatibility
Code and model weights are released under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
Currently, only the small.en
English model is fine-tuned. Timestamp behavior and deletion errors may differ from the original Whisper model. The project is still considered hacky and subject to change, with global diarization (speaker clustering) planned for a later stage.
1 year ago
1+ week