SoundMind  by xid32

Audio-language reasoning framework

Created 3 months ago
1,092 stars

Top 34.9% on SourcePulse

GitHubView on GitHub
Project Summary

SoundMind introduces a novel rule-based reinforcement learning framework for enhancing audio-language models (ALMs) with advanced logical reasoning across audio and text modalities. It is designed for researchers and developers working on multimodal AI, offering a specialized dataset and training methodology to improve ALMs' bimodal reasoning capabilities.

How It Works

SoundMind leverages a reinforcement learning (RL) approach, specifically a rule-based RL algorithm, to incentivize logical reasoning within ALMs. This method is designed to imbue models with deep bimodal reasoning abilities by training them on the Audio Logical Reasoning (ALR) dataset, which features chain-of-thought annotations for both audio and text.

Quick Start & Requirements

  • Installation: Requires setting up a conda environment with Python 3.10, CUDA 12.1+, and cuDNN 9.8.0+. Install vllm, sglang, mcore, and the project's dependencies via bash scripts/install_vllm_sglang_mcore.sh and pip install --no-deps -e .. Specific versions for transformers (4.52.3) and qwen-omni-utils are also recommended.
  • Hardware: Recommended 8x NVIDIA H800/H100 80GB GPUs.
  • Dataset: Download the ALR dataset via wget or Hugging Face.
  • Model Checkpoint: Download pre-trained checkpoints via wget.
  • Training/Evaluation: Run bash main_grpo.sh.
  • Documentation: Dataset Link, Checkpoint Link

Highlighted Details

  • Introduces the Audio Logical Reasoning (ALR) dataset with 6,446 annotated text-audio samples for complex reasoning.
  • Employs a rule-based RL framework for incentivizing logical reasoning in ALMs.
  • Codebase is built upon vllm and sglang for efficient inference and training.
  • Supports data preprocessing for text-only, audio-only, or bimodal inputs.

Maintenance & Community

The project is associated with authors from various institutions, as indicated by the citation. No specific community channels (Discord/Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The README does not explicitly state the license for the codebase or the dataset. The model checkpoint is provided for download, implying potential usage restrictions.

Limitations & Caveats

The project has significant hardware requirements (multiple high-end GPUs) and relies on specific, potentially complex dependency versions, including CUDA and cuDNN, which may lead to installation challenges. The absence of explicit licensing information could pose a barrier for commercial or broader open-source integration.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Jinze Bai Jinze Bai(Research Scientist at Alibaba Qwen), and
1 more.

Qwen-Audio by QwenLM

0.4%
2k
Audio-language model for audio understanding and chat
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.