Step-Audio  by stepfun-ai

Speech interaction framework for multilingual conversation and controllable speech synthesis

created 5 months ago
4,427 stars

Top 11.3% on sourcepulse

GitHubView on GitHub
Project Summary

Step-Audio is an open-source framework for intelligent speech interaction, designed for researchers and developers working with advanced speech processing. It unifies speech comprehension and generation, supporting multilingual conversations, emotional tones, dialects, and various prosodic styles, aiming to provide a production-ready solution for complex speech tasks.

How It Works

Step-Audio employs a 130B-parameter unified multimodal model for both understanding and generation. It tokenizes audio using a dual-codebook approach (semantic and acoustic) with temporal interleaving. The model is built upon a large language model foundation, enhanced with audio-contextualized pre-training and task-specific fine-tuning. A hybrid decoder combines flow matching with neural vocoding for waveform generation, and a streaming-aware architecture with speculative response generation enables real-time interaction.

Quick Start & Requirements

  • Installation: Clone the repository, create a Conda environment, install requirements, and clone model weights from Huggingface or Modelscope.
  • Prerequisites: Python >= 3.10, PyTorch >= 2.3-cu121, CUDA Toolkit, NVIDIA GPU.
  • Hardware: Minimum 1.5GB VRAM for tokenizer, 8GB for TTS-3B, and 265GB for Chat model. Recommended: 4x A800/H800 GPUs (80GB each).
  • Documentation: Official Docs

Highlighted Details

  • Unified 130B multimodal model for ASR, semantic understanding, dialogue, voice cloning, and TTS.
  • Generative Data Engine for high-quality audio generation, reducing reliance on manual data collection.
  • Granular voice control via instruction-based design for emotions, dialects, and vocal styles.
  • Integrated ToolCall mechanism and role-playing enhancements for improved agent performance.

Maintenance & Community

The project has recent releases (Feb 17, 2025) including inference code, model weights, and a technical report. Links to Huggingface and Modelscope for model downloads are provided.

Licensing & Compatibility

The code is licensed under Apache 2.0. However, the use of model weights requires adherence to specific licenses for Step-Audio-Chat, Step-Audio-Tokenizer, and Step-Audio-TTS-3B, which are not explicitly detailed in the README.

Limitations & Caveats

The 130B Step-Audio-Chat model requires substantial GPU memory (265GB). The vLLM inference for Step-Audio-Chat does not support audio input and requires a custom flash attention library due to a variant attention mechanism.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
1
Star History
199 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.