ast by YuanGongND

Purely attention-based audio spectrogram classification

Created 4 years ago

1,407 stars

Top 28.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Chenlin Meng

Cofounder of Pika

Project Summary

Summary

The YuanGongND/ast repository offers the official PyTorch implementation of the Audio Spectrogram Transformer (AST), a novel, convolution-free, attention-based model for audio classification. It addresses the need for advanced architectures in audio tasks, achieving state-of-the-art performance on benchmarks like AudioSet, ESC-50, and Speech Commands. AST is targeted at researchers and practitioners seeking high accuracy and flexibility in audio analysis.

How It Works

AST processes audio spectrograms by dividing them into patches, leveraging a purely attention-based Transformer architecture without convolutional layers. This design enables handling of variable-length inputs and has demonstrated superior performance, setting new state-of-the-art results across multiple audio classification tasks.

Quick Start & Requirements

Installation involves cloning the repository, setting up a Python virtual environment, and running pip install -r requirements.txt. Key parameters include label_dim, input_tdim, and model_size. The project provides recipes for ESC-50, Speechcommands, and AudioSet, automating data downloading and training. A Google Colab script allows for easy inference and attention visualization without a GPU. Pretrained models are readily available.

Highlighted Details

Achieves state-of-the-art results: 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.
Provides pretrained models trained on Full AudioSet and Speechcommands-V2-35.
Includes a Google Colab script for inference with attention visualization.
Features the PSLA training pipeline for efficient audio classification.

Maintenance & Community

Contact is available via email (yuangong@mit.edu) or GitHub issues. Recent news updates indicate ongoing interest and potential maintenance activity.

Licensing & Compatibility

The repository's README does not explicitly state a software license. This omission may hinder commercial use or integration into closed-source projects.

Limitations & Caveats

Reproducing AudioSet results requires manual data preparation. A potential bug with newer torchaudio versions affecting SpecAugment is noted. Pretrained models require audio input sampled at 16kHz. Achieving top ensemble results necessitates training and averaging multiple models with varying configurations.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

21 stars in the last 30 days