audio-flamingo by NVIDIA

PyTorch code for an audio-language model research paper

Created 1 year ago

951 stars

Top 38.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

This repository provides a PyTorch implementation of Audio Flamingo 2, an advanced audio-language model designed for long-audio understanding and expert reasoning. Targeting researchers and practitioners in audio AI, it offers state-of-the-art performance on over 20 benchmarks with a compact 3B parameter model, outperforming larger proprietary models.

How It Works

Audio Flamingo 2 employs a cross-attention architecture, similar to its predecessors, enabling it to process audio inputs up to 5 minutes in length. This approach allows for deep integration of audio features with language understanding, facilitating complex reasoning tasks. The model's architecture is derived from Open Flamingo, incorporating efficient attention mechanisms.

Quick Start & Requirements

Installation: The primary inference code is available in inference_HF_pretrained/.
Prerequisites: PyTorch, Hugging Face Transformers. Specific hardware requirements (e.g., GPU, VRAM) are not detailed but are implied for efficient inference.
Resources: Pretrained checkpoints are available on HuggingFace.

Highlighted Details

Achieves state-of-the-art performance on over 20 benchmarks, including expert audio reasoning and long-audio understanding.
Outperforms larger models like Gemini Flash v2 and Qwen-Audio despite its smaller 3B parameter size.
Trained exclusively on public datasets.
Introduces two new datasets: AudioSkills for expert reasoning and LongAudio for extended audio comprehension.

Maintenance & Community

The project is maintained by NVIDIA. Key components are based on Open Flamingo and LAION-AI/CLAP. Further community engagement details (e.g., Discord, Slack) are not provided in the README.

Licensing & Compatibility

The code is released under the MIT license. However, the checkpoints are subject to the NVIDIA OneWay Noncommercial License, the Qwen Research license, OpenAI's data terms, and original dataset licenses, restricting commercial use.

Limitations & Caveats

The checkpoints are explicitly licensed for non-commercial use only, posing a significant restriction for commercial applications. The model's reliance on specific NVIDIA licenses and Qwen's license may introduce compatibility concerns.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

37 stars in the last 30 days