Step-Audio  by stepfun-ai

Speech interaction framework for multilingual conversation and controllable speech synthesis

Created 7 months ago
4,516 stars

Top 10.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Step-Audio is an open-source framework for intelligent speech interaction, designed for researchers and developers working with advanced speech processing. It unifies speech comprehension and generation, supporting multilingual conversations, emotional tones, dialects, and various prosodic styles, aiming to provide a production-ready solution for complex speech tasks.

How It Works

Step-Audio employs a 130B-parameter unified multimodal model for both understanding and generation. It tokenizes audio using a dual-codebook approach (semantic and acoustic) with temporal interleaving. The model is built upon a large language model foundation, enhanced with audio-contextualized pre-training and task-specific fine-tuning. A hybrid decoder combines flow matching with neural vocoding for waveform generation, and a streaming-aware architecture with speculative response generation enables real-time interaction.

Quick Start & Requirements

  • Installation: Clone the repository, create a Conda environment, install requirements, and clone model weights from Huggingface or Modelscope.
  • Prerequisites: Python >= 3.10, PyTorch >= 2.3-cu121, CUDA Toolkit, NVIDIA GPU.
  • Hardware: Minimum 1.5GB VRAM for tokenizer, 8GB for TTS-3B, and 265GB for Chat model. Recommended: 4x A800/H800 GPUs (80GB each).
  • Documentation: Official Docs

Highlighted Details

  • Unified 130B multimodal model for ASR, semantic understanding, dialogue, voice cloning, and TTS.
  • Generative Data Engine for high-quality audio generation, reducing reliance on manual data collection.
  • Granular voice control via instruction-based design for emotions, dialects, and vocal styles.
  • Integrated ToolCall mechanism and role-playing enhancements for improved agent performance.

Maintenance & Community

The project has recent releases (Feb 17, 2025) including inference code, model weights, and a technical report. Links to Huggingface and Modelscope for model downloads are provided.

Licensing & Compatibility

The code is licensed under Apache 2.0. However, the use of model weights requires adherence to specific licenses for Step-Audio-Chat, Step-Audio-Tokenizer, and Step-Audio-TTS-3B, which are not explicitly detailed in the README.

Limitations & Caveats

The 130B Step-Audio-Chat model requires substantial GPU memory (265GB). The vLLM inference for Step-Audio-Chat does not support audio input and requires a custom flash attention library due to a variant attention mechanism.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
3
Star History
68 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
2 more.

ChatTTS by 2noise

0.2%
38k
Generative speech model for daily dialogue
Created 1 year ago
Updated 2 months ago
Feedback? Help us improve.