Joint audio tagging and speech recognition model
Top 73.4% on sourcepulse
Whisper-AT enhances OpenAI's Whisper by adding audio event tagging capabilities with minimal computational overhead. It targets users needing both speech transcription and sound event detection, offering a unified solution that maintains Whisper's ASR performance while providing 527-class AudioSet labels.
How It Works
Whisper-AT freezes the original Whisper encoder and trains a novel Time- and Layer-wise Transformer (TL-TR) on top of its representations. This approach leverages Whisper's robust audio understanding for audio tagging, achieving significant performance gains with less than 1% additional computational cost compared to using separate models.
Quick Start & Requirements
pip install whisper-at
pip install numba numpy torch tqdm more-itertools tiktoken==0.3.3
then pip install --no-deps whisper-at
ffmpeg
.import whisper_at as whisper
model = whisper.load_model("large-v1")
result = model.transcribe("audio.mp3", at_time_res=10)
print(result["text"])
audio_tag_result = whisper.parse_at_label(result, top_k=5)
print(audio_tag_result)
Highlighted Details
large-v1
model.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
at_time_res
parameter must be an integer multiple of 0.4 seconds.1 year ago
1 day