ast  by YuanGongND

Purely attention-based audio spectrogram classification

Created 4 years ago
1,365 stars

Top 29.6% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

The YuanGongND/ast repository offers the official PyTorch implementation of the Audio Spectrogram Transformer (AST), a novel, convolution-free, attention-based model for audio classification. It addresses the need for advanced architectures in audio tasks, achieving state-of-the-art performance on benchmarks like AudioSet, ESC-50, and Speech Commands. AST is targeted at researchers and practitioners seeking high accuracy and flexibility in audio analysis.

How It Works

AST processes audio spectrograms by dividing them into patches, leveraging a purely attention-based Transformer architecture without convolutional layers. This design enables handling of variable-length inputs and has demonstrated superior performance, setting new state-of-the-art results across multiple audio classification tasks.

Quick Start & Requirements

Installation involves cloning the repository, setting up a Python virtual environment, and running pip install -r requirements.txt. Key parameters include label_dim, input_tdim, and model_size. The project provides recipes for ESC-50, Speechcommands, and AudioSet, automating data downloading and training. A Google Colab script allows for easy inference and attention visualization without a GPU. Pretrained models are readily available.

Highlighted Details

  • Achieves state-of-the-art results: 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.
  • Provides pretrained models trained on Full AudioSet and Speechcommands-V2-35.
  • Includes a Google Colab script for inference with attention visualization.
  • Features the PSLA training pipeline for efficient audio classification.

Maintenance & Community

Contact is available via email (yuangong@mit.edu) or GitHub issues. Recent news updates indicate ongoing interest and potential maintenance activity.

Licensing & Compatibility

The repository's README does not explicitly state a software license. This omission may hinder commercial use or integration into closed-source projects.

Limitations & Caveats

Reproducing AudioSet results requires manual data preparation. A potential bug with newer torchaudio versions affecting SpecAugment is noted. Pretrained models require audio input sampled at 16kHz. Achieving top ensemble results necessitates training and averaging multiple models with varying configurations.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
23 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.