Audio-Reasoner by xzf-thu

Large audio language model for multimodal reasoning

Created 10 months ago

274 stars

Top 94.4% on SourcePulse

Project Summary

Large audio language model Audio-Reasoner enables in-depth, structured Chain-of-Thought (COT) reasoning for multimodal audio understanding. It targets researchers and developers in audio AI, offering state-of-the-art performance on audio benchmarks and facilitating advanced audio comprehension through its novel training approach.

How It Works

This project implements inference scaling for Audio-Reasoner, a large audio language model built upon Qwen2-Audio-Instruct. Its core innovation lies in training with structured COT techniques, utilizing the custom-built CoTA dataset comprising 1.2M reasoning-rich audio captions and QA pairs. This approach enables the model to perform in-depth audio reasoning across planning, captioning, reasoning, and summarization stages, leading to enhanced multimodal understanding.

Quick Start & Requirements

Installation involves cloning the repository, creating a Conda environment with Python 3.10, and installing dependencies via requirements.txt. Crucially, transformers==4.48.0 must be installed separately due to its impact on model performance. Users need to replace placeholder paths for model checkpoints and test audio files. The project provides links to HuggingFace for model checkpoints and an arXiv paper detailing its methodology.

Highlighted Details

Achieves state-of-the-art results on MMAU-mini (+25.42%) and AIR-Bench-Chat (+14.57%).
Leverages the custom-built CoTA dataset, comprising 1.2M reasoning-rich audio captions and QA pairs.
Supports diverse audio types, including sound, music, and speech, for comprehensive analysis.
Enables structured Chain-of-Thought (COT) reasoning for audio tasks.

Maintenance & Community

The project was initiated in March 2025, with key components like checkpoints and the paper released concurrently. A roadmap includes uploading the CoTA dataset to HuggingFace (March 2025) and open-sourcing the data synthesis pipeline and training code (April 2025). Contact is available via email at zhifei001@e.ntu.edu.sg.

Licensing & Compatibility

The provided README does not specify a software license, which is a critical omission for evaluating adoption and compatibility, particularly for commercial or closed-source use.

Limitations & Caveats

Users must manually provide paths for the model checkpoint and test audio files. The specific hardware requirements for running the model are not detailed. The project appears to be newly released (March 2025), suggesting potential for ongoing development and refinement.

Audio-Reasoner by xzf-thu

Explore Similar Projects

AudioBench by AudioLLMs

Step-Audio-R1 by stepfun-ai

UniAudio by yangdongchao

dasheng-lm by xiaomi-research

ltu by YuanGongND

VITA-Audio by VITA-MLLM

audio-flamingo by NVIDIA

MiMo-Audio by XiaomiMiMo

SoundMind by xid32

Qwen-Audio by QwenLM

ast by YuanGongND

Kimi-Audio by MoonshotAI