audio-flamingo  by NVIDIA

PyTorch code for an audio-language model research paper

created 1 year ago
676 stars

Top 51.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a PyTorch implementation of Audio Flamingo 2, an advanced audio-language model designed for long-audio understanding and expert reasoning. Targeting researchers and practitioners in audio AI, it offers state-of-the-art performance on over 20 benchmarks with a compact 3B parameter model, outperforming larger proprietary models.

How It Works

Audio Flamingo 2 employs a cross-attention architecture, similar to its predecessors, enabling it to process audio inputs up to 5 minutes in length. This approach allows for deep integration of audio features with language understanding, facilitating complex reasoning tasks. The model's architecture is derived from Open Flamingo, incorporating efficient attention mechanisms.

Quick Start & Requirements

  • Installation: The primary inference code is available in inference_HF_pretrained/.
  • Prerequisites: PyTorch, Hugging Face Transformers. Specific hardware requirements (e.g., GPU, VRAM) are not detailed but are implied for efficient inference.
  • Resources: Pretrained checkpoints are available on HuggingFace.

Highlighted Details

  • Achieves state-of-the-art performance on over 20 benchmarks, including expert audio reasoning and long-audio understanding.
  • Outperforms larger models like Gemini Flash v2 and Qwen-Audio despite its smaller 3B parameter size.
  • Trained exclusively on public datasets.
  • Introduces two new datasets: AudioSkills for expert reasoning and LongAudio for extended audio comprehension.

Maintenance & Community

The project is maintained by NVIDIA. Key components are based on Open Flamingo and LAION-AI/CLAP. Further community engagement details (e.g., Discord, Slack) are not provided in the README.

Licensing & Compatibility

The code is released under the MIT license. However, the checkpoints are subject to the NVIDIA OneWay Noncommercial License, the Qwen Research license, OpenAI's data terms, and original dataset licenses, restricting commercial use.

Limitations & Caveats

The checkpoints are explicitly licensed for non-commercial use only, posing a significant restriction for commercial applications. The model's reliance on specific NVIDIA licenses and Qwen's license may introduce compatibility concerns.

Health Check
Last commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
18
Star History
219 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

AudioGPT by AIGC-Audio

0.1%
10k
Audio processing and generation research project
created 2 years ago
updated 1 year ago
Feedback? Help us improve.