Discover and explore top open-source AI tools and projects—updated daily.
YuanGongNDPurely attention-based audio spectrogram classification
Top 29.6% on SourcePulse
Summary
The YuanGongND/ast repository offers the official PyTorch implementation of the Audio Spectrogram Transformer (AST), a novel, convolution-free, attention-based model for audio classification. It addresses the need for advanced architectures in audio tasks, achieving state-of-the-art performance on benchmarks like AudioSet, ESC-50, and Speech Commands. AST is targeted at researchers and practitioners seeking high accuracy and flexibility in audio analysis.
How It Works
AST processes audio spectrograms by dividing them into patches, leveraging a purely attention-based Transformer architecture without convolutional layers. This design enables handling of variable-length inputs and has demonstrated superior performance, setting new state-of-the-art results across multiple audio classification tasks.
Quick Start & Requirements
Installation involves cloning the repository, setting up a Python virtual environment, and running pip install -r requirements.txt. Key parameters include label_dim, input_tdim, and model_size. The project provides recipes for ESC-50, Speechcommands, and AudioSet, automating data downloading and training. A Google Colab script allows for easy inference and attention visualization without a GPU. Pretrained models are readily available.
Highlighted Details
Maintenance & Community
Contact is available via email (yuangong@mit.edu) or GitHub issues. Recent news updates indicate ongoing interest and potential maintenance activity.
Licensing & Compatibility
The repository's README does not explicitly state a software license. This omission may hinder commercial use or integration into closed-source projects.
Limitations & Caveats
Reproducing AudioSet results requires manual data preparation. A potential bug with newer torchaudio versions affecting SpecAugment is noted. Pretrained models require audio input sampled at 16kHz. Achieving top ensemble results necessitates training and averaging multiple models with varying configurations.
2 years ago
Inactive
lucidrains