Step-Audio-R1  by stepfun-ai

Audio reasoning model enabling test-time compute scaling

Created 2 months ago
440 stars

Top 67.9% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> Step-Audio-R1 addresses the "inverted scaling" anomaly in audio language models, where performance degrades with longer reasoning chains. It targets researchers and power users by enabling test-time compute scaling for audio intelligence. The primary benefit is achieving state-of-the-art audio reasoning capabilities, surpassing leading models like Gemini 2.5 Pro, by grounding reasoning in acoustic properties.

How It Works

The model integrates a frozen Qwen2 audio encoder with a Qwen2.5 32B LLM decoder via a specialized adaptor. Its core innovation is Modality-Grounded Reasoning Distillation (MGRD), an iterative training process that shifts the model's reasoning focus from textual surrogates to genuine acoustic properties, enabling "native audio think." This approach grounds reasoning in acoustic nuances, resolving modality mismatch and unlocking performance gains from extended deliberation.

Quick Start & Requirements

  • Requirements: NVIDIA GPUs with CUDA support, Linux operating system, Python version >= 3.10.0.
  • Installation: Model weights can be cloned via Git LFS or downloaded using the Hugging Face CLI from https://huggingface.co/stepfun-ai/Step-Audio-R1. Deployment is recommended via a custom Docker image (stepfun2025/vllm:step-audio-2-v20250909) or by compiling a customized vLLM backend from source (https://github.com/stepfun-ai/vllm). Example client scripts are available via git clone https://github.com/stepfun-ai/Step-Audio-R1.git.

Highlighted Details

  • First audio reasoning model to successfully benefit from test-time compute scaling.
  • Achieves performance comparable to or exceeding Gemini 3 and Gemini 2.5 Pro on comprehensive audio benchmarks.
  • Resolves the "inverted scaling" issue plaguing conventional audio models by grounding reasoning in acoustic nuances.

Maintenance & Community

  • Recent releases and technical reports are dated November 2025, indicating active development.
  • No explicit community channels (e.g., Discord, Slack) or roadmap links are provided in the README.
  • A Star History chart is available for tracking community engagement.

Licensing & Compatibility

  • License: The repository's license is not specified in the provided README, posing a significant adoption blocker.
  • Compatibility: Requires a customized vLLM backend, which may introduce complexities compared to standard vLLM deployments.

Limitations & Caveats

Deployment requires specific hardware (NVIDIA GPUs) and OS (Linux), and involves either custom Docker images or compiling a modified vLLM backend, increasing setup complexity. The project's recent release dates (November 2025) suggest it is cutting-edge but potentially less battle-tested.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
33 stars in the last 30 days

Explore Similar Projects

Starred by Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), Jinze Bai Jinze Bai(Research Scientist at Alibaba Qwen), and
1 more.

Qwen-Audio by QwenLM

0.2%
2k
Audio-language model for audio understanding and chat
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.