Discover and explore top open-source AI tools and projects—updated daily.
xzf-thuLarge audio language model for multimodal reasoning
Top 96.7% on SourcePulse
Large audio language model Audio-Reasoner enables in-depth, structured Chain-of-Thought (COT) reasoning for multimodal audio understanding. It targets researchers and developers in audio AI, offering state-of-the-art performance on audio benchmarks and facilitating advanced audio comprehension through its novel training approach.
How It Works
This project implements inference scaling for Audio-Reasoner, a large audio language model built upon Qwen2-Audio-Instruct. Its core innovation lies in training with structured COT techniques, utilizing the custom-built CoTA dataset comprising 1.2M reasoning-rich audio captions and QA pairs. This approach enables the model to perform in-depth audio reasoning across planning, captioning, reasoning, and summarization stages, leading to enhanced multimodal understanding.
Quick Start & Requirements
Installation involves cloning the repository, creating a Conda environment with Python 3.10, and installing dependencies via requirements.txt. Crucially, transformers==4.48.0 must be installed separately due to its impact on model performance. Users need to replace placeholder paths for model checkpoints and test audio files. The project provides links to HuggingFace for model checkpoints and an arXiv paper detailing its methodology.
Highlighted Details
Maintenance & Community
The project was initiated in March 2025, with key components like checkpoints and the paper released concurrently. A roadmap includes uploading the CoTA dataset to HuggingFace (March 2025) and open-sourcing the data synthesis pipeline and training code (April 2025). Contact is available via email at zhifei001@e.ntu.edu.sg.
Licensing & Compatibility
The provided README does not specify a software license, which is a critical omission for evaluating adoption and compatibility, particularly for commercial or closed-source use.
Limitations & Caveats
Users must manually provide paths for the model checkpoint and test audio files. The specific hardware requirements for running the model are not detailed. The project appears to be newly released (March 2025), suggesting potential for ongoing development and refinement.
5 months ago
Inactive
QwenLM