Discover and explore top open-source AI tools and projects—updated daily.
MahmoudAshraf97ASR pipeline for speaker diarization
Top 9.8% on SourcePulse
This repository provides a pipeline for automatic speech recognition (ASR) with speaker diarization, leveraging OpenAI's Whisper model. It's designed for researchers and developers working with audio data who need to identify who spoke which sentence in a transcription. The primary benefit is an integrated solution for accurate speaker attribution alongside speech-to-text.
How It Works
The pipeline first extracts vocals using Demucs for improved speaker embedding accuracy. It then generates a transcription using Whisper, followed by timestamp correction and alignment with ctc-forced-aligner to mitigate time-shift errors. Voice Activity Detection (VAD) and segmentation are performed using NeMo's MarbleNet to isolate speech segments, excluding silence. Speaker embeddings are extracted using NeMo's TitaNet, and these embeddings are associated with the corrected timestamps to assign speakers to words. Finally, punctuation models are used for realignment to compensate for minor time shifts.
Quick Start & Requirements
pip install -c constraints.txt -r requirements.txtsudo apt update && sudo apt install cython3 ffmpegbrew install ffmpegchoco install ffmpeg or scoop install ffmpeg or winget install ffmpegpython diarize.py -a AUDIO_FILE_NAMEdiarize_parallel.py can be used if VRAM >= 10GB for parallel NeMo and Whisper execution.Highlighted Details
ctc-forced-aligner for timestamp correction and NeMo's MarbleNet for VAD/segmentation.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Overlapping speakers are not yet addressed. The project is experimental, and users are encouraged to report any encountered errors.
2 weeks ago
1 day
huggingface
KoljaB
m-bain