Speech interaction framework for multilingual conversation and controllable speech synthesis
Top 11.3% on sourcepulse
Step-Audio is an open-source framework for intelligent speech interaction, designed for researchers and developers working with advanced speech processing. It unifies speech comprehension and generation, supporting multilingual conversations, emotional tones, dialects, and various prosodic styles, aiming to provide a production-ready solution for complex speech tasks.
How It Works
Step-Audio employs a 130B-parameter unified multimodal model for both understanding and generation. It tokenizes audio using a dual-codebook approach (semantic and acoustic) with temporal interleaving. The model is built upon a large language model foundation, enhanced with audio-contextualized pre-training and task-specific fine-tuning. A hybrid decoder combines flow matching with neural vocoding for waveform generation, and a streaming-aware architecture with speculative response generation enables real-time interaction.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
The project has recent releases (Feb 17, 2025) including inference code, model weights, and a technical report. Links to Huggingface and Modelscope for model downloads are provided.
Licensing & Compatibility
The code is licensed under Apache 2.0. However, the use of model weights requires adherence to specific licenses for Step-Audio-Chat, Step-Audio-Tokenizer, and Step-Audio-TTS-3B, which are not explicitly detailed in the README.
Limitations & Caveats
The 130B Step-Audio-Chat model requires substantial GPU memory (265GB). The vLLM inference for Step-Audio-Chat does not support audio input and requires a custom flash attention library due to a variant attention mechanism.
1 month ago
1 day