Step-Audio2 by stepfun-ai

End-to-end audio understanding and speech conversation model

Created 6 months ago

1,293 stars

Top 30.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Pawel Garbacki

Cofounder of Fireworks AI

Project Summary

Step-Audio 2 is an end-to-end multi-modal large language model for advanced audio understanding and speech conversation. It targets developers and researchers needing robust audio processing capabilities, offering industry-strength performance in ASR, paralinguistic analysis, and tool-calling integration for reduced hallucinations and flexible response generation.

How It Works

Step-Audio 2 employs a multi-modal LLM architecture designed for comprehensive audio comprehension. It integrates Automatic Speech Recognition (ASR), paralinguistic information processing (gender, age, timbre, emotion), and multimodal Retrieval Augmented Generation (RAG) with tool-calling capabilities. This approach allows it to reason over semantic and non-vocal audio cues, enabling more natural conversations and contextually relevant responses by accessing external knowledge.

Quick Start & Requirements

The project provides a technical report and demonstration videos. Specific installation and execution commands are not detailed in the README. Requirements likely include significant computational resources (GPU, CUDA) and potentially large audio datasets for full functionality.

Highlighted Details

Achieves state-of-the-art performance on various audio understanding and conversational benchmarks, outperforming models like GPT-4o Audio and Qwen-Omni in several metrics.
Demonstrates strong performance in ASR across multiple languages (English, Chinese, Cantonese, Japanese, Arabian) and accents.
Excels in paralinguistic information understanding, achieving a 76.55 average score on the StepEval-Audio-Paralinguistic benchmark.
Features robust tool-calling capabilities, with high precision/recall for audio search, date/time, weather, and web search tasks.

Maintenance & Community

The project is associated with stepfun-ai. Recent updates (July 2025) include the release of demonstration videos, technical reports, and new benchmarks (StepEval-Audio-Paralinguistic, StepEval-Audio-Toolcall). A citation to the technical report is provided.

Licensing & Compatibility

The repository is licensed under the Apache 2.0 License, which permits commercial use and linking with closed-source projects.

Limitations & Caveats

The README does not provide specific installation instructions or quick-start guides, suggesting a focus on research and advanced users. Support for certain languages in ASR is marked as N/A, indicating potential limitations in multilingual coverage.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

44 stars in the last 30 days