whisper-diarization by MahmoudAshraf97

ASR pipeline for speaker diarization

Created 3 years ago

5,293 stars

Top 9.3% on SourcePulse

4 Experts Love This Project

tjbck

Founder of Open WebUI

ogabrielluiz

Gabriel Almeida

Cofounder of Langflow

jph00

Cofounder of fast.ai

transitive-bullshit

Founder of Agentic

Project Summary

This repository provides a pipeline for automatic speech recognition (ASR) with speaker diarization, leveraging OpenAI's Whisper model. It's designed for researchers and developers working with audio data who need to identify who spoke which sentence in a transcription. The primary benefit is an integrated solution for accurate speaker attribution alongside speech-to-text.

How It Works

The pipeline first extracts vocals using Demucs for improved speaker embedding accuracy. It then generates a transcription using Whisper, followed by timestamp correction and alignment with ctc-forced-aligner to mitigate time-shift errors. Voice Activity Detection (VAD) and segmentation are performed using NeMo's MarbleNet to isolate speech segments, excluding silence. Speaker embeddings are extracted using NeMo's TitaNet, and these embeddings are associated with the corrected timestamps to assign speakers to words. Finally, punctuation models are used for realignment to compensate for minor time shifts.

Quick Start & Requirements

Install: pip install -c constraints.txt -r requirements.txt
Prerequisites: Python >= 3.10, FFMPEG, Cython.
- Ubuntu/Debian: sudo apt update && sudo apt install cython3 ffmpeg
- macOS: brew install ffmpeg
- Windows: choco install ffmpeg or scoop install ffmpeg or winget install ffmpeg
Usage: python diarize.py -a AUDIO_FILE_NAME
Parallel Processing: diarize_parallel.py can be used if VRAM >= 10GB for parallel NeMo and Whisper execution.

Highlighted Details

Integrates Whisper ASR with Voice Activity Detection (VAD) and Speaker Embedding.
Utilizes ctc-forced-aligner for timestamp correction and NeMo's MarbleNet for VAD/segmentation.
Employs NeMo's TitaNet for speaker embedding extraction.
Supports parallel processing for potentially faster execution on high-VRAM systems.

Maintenance & Community

Project is based on OpenAI's Whisper, Faster Whisper, Nvidia NeMo, and Facebook's Demucs.
Citation details are provided in BibTeX format.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

Overlapping speakers are not yet addressed. The project is experimental, and users are encouraged to report any encountered errors.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

3

Issues (30d)

2

Star History

50 stars in the last 30 days

Explore Similar Projects

speech-recognition-uk by egorsmkv

Resource collection for Ukrainian speech AI

Created 5 years ago

Updated 4 months ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Jonathan Ragan-Kelley

Jonathan Ragan-Kelley(Professor at MIT), and

2 more.

tinydiarize by akashmjn

Finetuned speech model for speaker diarization

Created 2 years ago

Updated 2 years ago

Transcribro by soupslurpr

Android app for private, on-device speech recognition

Created 1 year ago

Updated 4 months ago

Starred by

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

ctc-segmentation by lumaku

Python package for audio segmentation and utterance alignment

Created 5 years ago

Updated 1 year ago

Starred by

Clement Delangue

Clement Delangue(Cofounder of Hugging Face),

Philipp Schmid

Philipp Schmid(DevRel at Google DeepMind), and

3 more.

speechbox by huggingface

Speech processing tools (punctuation, diarization)

Created 3 years ago

Updated 1 year ago

AIVoiceChat by KoljaB

Voice chat for low-latency AI companion interaction

Created 2 years ago

Updated 6 months ago

Starred by

Alexander Borzunov

Alexander Borzunov(Research Scientist at OpenAI).

speech_course by yandexdataschool

Speech processing course materials

Created 4 years ago

Updated 5 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

whisper-plus by kadirnar

Speech-to-text toolkit for enhanced audio processing

Created 2 years ago

Updated 1 month ago

ASR-LLM-TTS by ABexit

Speech interaction system integrating ASR, LLM, and TTS

Created 1 year ago

Updated 10 months ago

Starred by

Jason Liu

Jason Liu(Author of Instructor).

stable-ts by jianfch

SDK for enhanced audio transcription using OpenAI's Whisper

Created 3 years ago

Updated 2 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Travis Fischer

Travis Fischer(Founder of Agentic).

RealtimeSTT by KoljaB

Speech-to-text library for realtime applications

Created 2 years ago

Updated 6 months ago

Starred by

Georgios Konstantopoulos

Georgios Konstantopoulos(CTO, General Partner at Paradigm),

Paul Gauthier

Paul Gauthier(Founder of Aider), and

9 more.

whisperX by m-bain

ASR tool for accurate, batched, word-level Whisper transcriptions

Created 3 years ago

Updated 2 months ago

Feedback? Help us improve.