SoundMind by xid32

Audio-language reasoning framework

Created 7 months ago

1,103 stars

Top 34.6% on SourcePulse

Project Summary

SoundMind introduces a novel rule-based reinforcement learning framework for enhancing audio-language models (ALMs) with advanced logical reasoning across audio and text modalities. It is designed for researchers and developers working on multimodal AI, offering a specialized dataset and training methodology to improve ALMs' bimodal reasoning capabilities.

How It Works

SoundMind leverages a reinforcement learning (RL) approach, specifically a rule-based RL algorithm, to incentivize logical reasoning within ALMs. This method is designed to imbue models with deep bimodal reasoning abilities by training them on the Audio Logical Reasoning (ALR) dataset, which features chain-of-thought annotations for both audio and text.

Quick Start & Requirements

Installation: Requires setting up a conda environment with Python 3.10, CUDA 12.1+, and cuDNN 9.8.0+. Install vllm, sglang, mcore, and the project's dependencies via bash scripts/install_vllm_sglang_mcore.sh and pip install --no-deps -e .. Specific versions for transformers (4.52.3) and qwen-omni-utils are also recommended.
Hardware: Recommended 8x NVIDIA H800/H100 80GB GPUs.
Dataset: Download the ALR dataset via wget or Hugging Face.
Model Checkpoint: Download pre-trained checkpoints via wget.
Training/Evaluation: Run bash main_grpo.sh.
Documentation: Dataset Link, Checkpoint Link

Highlighted Details

Introduces the Audio Logical Reasoning (ALR) dataset with 6,446 annotated text-audio samples for complex reasoning.
Employs a rule-based RL framework for incentivizing logical reasoning in ALMs.
Codebase is built upon vllm and sglang for efficient inference and training.
Supports data preprocessing for text-only, audio-only, or bimodal inputs.

Maintenance & Community

The project is associated with authors from various institutions, as indicated by the citation. No specific community channels (Discord/Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The README does not explicitly state the license for the codebase or the dataset. The model checkpoint is provided for download, implying potential usage restrictions.

Limitations & Caveats

The project has significant hardware requirements (multiple high-end GPUs) and relies on specific, potentially complex dependency versions, including CUDA and cuDNN, which may lead to installation challenges. The absence of explicit licensing information could pose a barrier for commercial or broader open-source integration.

SoundMind by xid32

Explore Similar Projects

AudioBench by AudioLLMs

Audio-Reasoner by xzf-thu

VoiceStar by jasonppy

Step-Audio-R1 by stepfun-ai

ltu by YuanGongND

VITA-Audio by VITA-MLLM

audio-flamingo by NVIDIA

dia2 by nari-labs

SALMONN by bytedance

Qwen-Audio by QwenLM

Kimi-Audio by MoonshotAI

vall-e by lifeiteng