ASR pipeline for speaker diarization
Top 10.6% on sourcepulse
This repository provides a pipeline for automatic speech recognition (ASR) with speaker diarization, leveraging OpenAI's Whisper model. It's designed for researchers and developers working with audio data who need to identify who spoke which sentence in a transcription. The primary benefit is an integrated solution for accurate speaker attribution alongside speech-to-text.
How It Works
The pipeline first extracts vocals using Demucs for improved speaker embedding accuracy. It then generates a transcription using Whisper, followed by timestamp correction and alignment with ctc-forced-aligner
to mitigate time-shift errors. Voice Activity Detection (VAD) and segmentation are performed using NeMo's MarbleNet to isolate speech segments, excluding silence. Speaker embeddings are extracted using NeMo's TitaNet, and these embeddings are associated with the corrected timestamps to assign speakers to words. Finally, punctuation models are used for realignment to compensate for minor time shifts.
Quick Start & Requirements
pip install -c constraints.txt -r requirements.txt
sudo apt update && sudo apt install cython3 ffmpeg
brew install ffmpeg
choco install ffmpeg
or scoop install ffmpeg
or winget install ffmpeg
python diarize.py -a AUDIO_FILE_NAME
diarize_parallel.py
can be used if VRAM >= 10GB for parallel NeMo and Whisper execution.Highlighted Details
ctc-forced-aligner
for timestamp correction and NeMo's MarbleNet for VAD/segmentation.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Overlapping speakers are not yet addressed. The project is experimental, and users are encouraged to report any encountered errors.
1 week ago
1 day