whisper-diarization  by MahmoudAshraf97

ASR pipeline for speaker diarization

created 2 years ago
4,777 stars

Top 10.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a pipeline for automatic speech recognition (ASR) with speaker diarization, leveraging OpenAI's Whisper model. It's designed for researchers and developers working with audio data who need to identify who spoke which sentence in a transcription. The primary benefit is an integrated solution for accurate speaker attribution alongside speech-to-text.

How It Works

The pipeline first extracts vocals using Demucs for improved speaker embedding accuracy. It then generates a transcription using Whisper, followed by timestamp correction and alignment with ctc-forced-aligner to mitigate time-shift errors. Voice Activity Detection (VAD) and segmentation are performed using NeMo's MarbleNet to isolate speech segments, excluding silence. Speaker embeddings are extracted using NeMo's TitaNet, and these embeddings are associated with the corrected timestamps to assign speakers to words. Finally, punctuation models are used for realignment to compensate for minor time shifts.

Quick Start & Requirements

  • Install: pip install -c constraints.txt -r requirements.txt
  • Prerequisites: Python >= 3.10, FFMPEG, Cython.
    • Ubuntu/Debian: sudo apt update && sudo apt install cython3 ffmpeg
    • macOS: brew install ffmpeg
    • Windows: choco install ffmpeg or scoop install ffmpeg or winget install ffmpeg
  • Usage: python diarize.py -a AUDIO_FILE_NAME
  • Parallel Processing: diarize_parallel.py can be used if VRAM >= 10GB for parallel NeMo and Whisper execution.

Highlighted Details

  • Integrates Whisper ASR with Voice Activity Detection (VAD) and Speaker Embedding.
  • Utilizes ctc-forced-aligner for timestamp correction and NeMo's MarbleNet for VAD/segmentation.
  • Employs NeMo's TitaNet for speaker embedding extraction.
  • Supports parallel processing for potentially faster execution on high-VRAM systems.

Maintenance & Community

  • Project is based on OpenAI's Whisper, Faster Whisper, Nvidia NeMo, and Facebook's Demucs.
  • Citation details are provided in BibTeX format.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

Overlapping speakers are not yet addressed. The project is experimental, and users are encouraged to report any encountered errors.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
2
Star History
352 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.