Step-Audio by stepfun-ai

Speech interaction framework for multilingual conversation and controllable speech synthesis

Created 11 months ago

4,587 stars

Top 10.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luis Capelo

Cofounder of Lightning AI

Project Summary

Step-Audio is an open-source framework for intelligent speech interaction, designed for researchers and developers working with advanced speech processing. It unifies speech comprehension and generation, supporting multilingual conversations, emotional tones, dialects, and various prosodic styles, aiming to provide a production-ready solution for complex speech tasks.

How It Works

Step-Audio employs a 130B-parameter unified multimodal model for both understanding and generation. It tokenizes audio using a dual-codebook approach (semantic and acoustic) with temporal interleaving. The model is built upon a large language model foundation, enhanced with audio-contextualized pre-training and task-specific fine-tuning. A hybrid decoder combines flow matching with neural vocoding for waveform generation, and a streaming-aware architecture with speculative response generation enables real-time interaction.

Quick Start & Requirements

Installation: Clone the repository, create a Conda environment, install requirements, and clone model weights from Huggingface or Modelscope.
Prerequisites: Python >= 3.10, PyTorch >= 2.3-cu121, CUDA Toolkit, NVIDIA GPU.
Hardware: Minimum 1.5GB VRAM for tokenizer, 8GB for TTS-3B, and 265GB for Chat model. Recommended: 4x A800/H800 GPUs (80GB each).
Documentation: Official Docs

Highlighted Details

Unified 130B multimodal model for ASR, semantic understanding, dialogue, voice cloning, and TTS.
Generative Data Engine for high-quality audio generation, reducing reliance on manual data collection.
Granular voice control via instruction-based design for emotions, dialects, and vocal styles.
Integrated ToolCall mechanism and role-playing enhancements for improved agent performance.

Maintenance & Community

The project has recent releases (Feb 17, 2025) including inference code, model weights, and a technical report. Links to Huggingface and Modelscope for model downloads are provided.

Licensing & Compatibility

The code is licensed under Apache 2.0. However, the use of model weights requires adherence to specific licenses for Step-Audio-Chat, Step-Audio-Tokenizer, and Step-Audio-TTS-3B, which are not explicitly detailed in the README.

Limitations & Caveats

The 130B Step-Audio-Chat model requires substantial GPU memory (265GB). The vLLM inference for Step-Audio-Chat does not support audio input and requires a custom flash attention library due to a variant attention mechanism.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

22 stars in the last 30 days