Audio preprocessing for optimized Whisper transcriptions
Top 83.6% on sourcepulse
This repository provides experimental Python code for preprocessing audio files to improve Whisper transcription accuracy and reduce hallucinations. It targets users seeking more reliable transcriptions from noisy or complex audio, offering a suite of audio manipulation techniques.
How It Works
The core approach involves a multi-stage audio preprocessing pipeline. It leverages tools like Facebook Demucs or Deezer Spleeter for voice extraction, ffmpeg for silence removal and loudness normalization, and Silero VAD for noise reduction. The process can also include adding voice markers, applying speech compression, and experimenting with various time-stretching methods to optimize the audio for Whisper's transcription models.
Quick Start & Requirements
ffmpeg
(version >= 4.4 recommended, upgrade instructions provided), openai-whisper
, torchaudio
, and optionally demucs
, spleeter
, or faster-whisper
.ffmpeg
.ffmpeg
.Highlighted Details
Maintenance & Community
The project is experimental and appears to be a demonstration of the author's capabilities. Contact information for commercial projects is provided via https://cubaix.com.
Licensing & Compatibility
The repository does not explicitly state a license. The inclusion of code from openai/whisper
and faster-whisper
implies adherence to their respective licenses. Commercial use is not explicitly addressed.
Limitations & Caveats
The code is described as "experimental" and results may vary. Some preprocessing steps, like time stretching, have not shown significant gains for the author. Compatibility with specific ffmpeg
versions is crucial, and Google Colab's default version may require upgrading. The effectiveness of prompts is language-dependent and may require tuning.
8 months ago
Inactive