Audio transcription/diarization using Whisper and pyannote-audio
Top 80.8% on sourcepulse
This repository provides a tutorial on combining OpenAI's Whisper for speech-to-text transcription with pyannote.audio
for speaker diarization, addressing the limitation of Whisper not identifying speakers in conversations. It's targeted at users needing to analyze multi-speaker audio content, offering a practical solution for segmenting and labeling speech.
How It Works
The approach leverages yt-dlp
to download and extract audio from videos, then pydub
to segment the audio. pyannote.audio
is used with a pre-trained pipeline to perform speaker diarization, identifying speech segments and assigning speaker labels. Finally, OpenAI's Whisper model transcribes these diarized segments, and the output is combined into an HTML file that annotates the transcriptions with speaker information and timestamps.
Quick Start & Requirements
pip install -U yt-dlp pydub pyannote.audio webvtt-py
and ffmpeg
.yt-dlp -xv --ffmpeg-location <ffmpeg_path> --audio-format wav -o download.wav <youtube_url>
ffmpeg
.Highlighted Details
pyannote.audio
for comprehensive speech analysis.Maintenance & Community
Licensing & Compatibility
pyannote.audio
is typically under MIT.Limitations & Caveats
pyannote.audio
and Whisper, requiring a specific execution order.2 years ago
1+ week