Audio-Reasoner  by xzf-thu

Large audio language model for multimodal reasoning

Created 8 months ago
264 stars

Top 96.7% on SourcePulse

GitHubView on GitHub
Project Summary

Large audio language model Audio-Reasoner enables in-depth, structured Chain-of-Thought (COT) reasoning for multimodal audio understanding. It targets researchers and developers in audio AI, offering state-of-the-art performance on audio benchmarks and facilitating advanced audio comprehension through its novel training approach.

How It Works

This project implements inference scaling for Audio-Reasoner, a large audio language model built upon Qwen2-Audio-Instruct. Its core innovation lies in training with structured COT techniques, utilizing the custom-built CoTA dataset comprising 1.2M reasoning-rich audio captions and QA pairs. This approach enables the model to perform in-depth audio reasoning across planning, captioning, reasoning, and summarization stages, leading to enhanced multimodal understanding.

Quick Start & Requirements

Installation involves cloning the repository, creating a Conda environment with Python 3.10, and installing dependencies via requirements.txt. Crucially, transformers==4.48.0 must be installed separately due to its impact on model performance. Users need to replace placeholder paths for model checkpoints and test audio files. The project provides links to HuggingFace for model checkpoints and an arXiv paper detailing its methodology.

Highlighted Details

  • Achieves state-of-the-art results on MMAU-mini (+25.42%) and AIR-Bench-Chat (+14.57%).
  • Leverages the custom-built CoTA dataset, comprising 1.2M reasoning-rich audio captions and QA pairs.
  • Supports diverse audio types, including sound, music, and speech, for comprehensive analysis.
  • Enables structured Chain-of-Thought (COT) reasoning for audio tasks.

Maintenance & Community

The project was initiated in March 2025, with key components like checkpoints and the paper released concurrently. A roadmap includes uploading the CoTA dataset to HuggingFace (March 2025) and open-sourcing the data synthesis pipeline and training code (April 2025). Contact is available via email at zhifei001@e.ntu.edu.sg.

Licensing & Compatibility

The provided README does not specify a software license, which is a critical omission for evaluating adoption and compatibility, particularly for commercial or closed-source use.

Limitations & Caveats

Users must manually provide paths for the model checkpoint and test audio files. The specific hardware requirements for running the model are not detailed. The project appears to be newly released (March 2025), suggesting potential for ongoing development and refinement.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Jinze Bai Jinze Bai(Research Scientist at Alibaba Qwen), and
1 more.

Qwen-Audio by QwenLM

0.2%
2k
Audio-language model for audio understanding and chat
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.