Step-Audio-EditX by stepfun-ai

LLM-driven audio model for expressive editing and TTS

Created 3 months ago

871 stars

Top 41.0% on SourcePulse

Project Summary

A powerful 3B-parameter LLM-based reinforcement learning model, Step-Audio-EditX addresses complex audio editing tasks by enabling precise control over emotion, speaking style, and paralinguistics. It offers robust zero-shot text-to-speech capabilities, targeting researchers and developers seeking advanced audio manipulation tools. The project provides iterative editing features for nuanced audio refinement.

How It Works

The model architecture comprises a dual-codebook audio tokenizer, an LLM for generating token sequences, and a flow-matching decoder that reconstructs audio waveforms. Iterative control over attributes like emotion and speaking style is achieved through reinforcement learning, leveraging large-margin data during SFT and PPO training for refined audio outputs.

Quick Start & Requirements

Installation: Clone the repository, create and activate a Conda environment (python=3.10), install dependencies (pip install -r requirements.txt), and clone the model weights from HuggingFace or ModelScope. Docker support is also available.
Prerequisites: Python >= 3.10, PyTorch >= 2.4.1-cu121, CUDA Toolkit, and an NVIDIA GPU with at least 12GB VRAM (16GB recommended). Tested on Linux.
Resource Footprint: Requires approximately 12GB GPU memory for inference. Quantization options (INT8, INT4, AWQ 4-bit) are available for reduced memory usage.
Links:
- Models: HuggingFace, ModelScope
- Demo Page: Demo Page
- Technical Report: arXiv

Highlighted Details

Zero-Shot TTS: Supports voice cloning and style transfer for Mandarin, English, Sichuanese, and Cantonese, utilizing dialect tags.
Expressive Control: Enables iterative editing across dozens of emotions (e.g., Happy, Angry, Sad) and speaking styles (e.g., Whisper, Serious, Child).
Paralinguistic Editing: Offers precise control over 10 paralinguistic features, including breathing, laughter, sighs, and hesitation sounds.
Performance: Demonstrates superior performance compared to Minimax and Doubao in zero-shot cloning and emotion control, with significant improvements achieved through iterative editing.

Maintenance & Community

The project actively releases updates and provides model checkpoints. Feature requests and community feedback are managed via the GitHub Discussions section. Specific community channels (e.g., Discord, Slack) or a public roadmap are not detailed in the provided information.

Licensing & Compatibility

The code is licensed under the Apache 2.0 License, which permits commercial use and integration with closed-source projects.

Limitations & Caveats

The project is under active development, with planned features such as polyphone pronunciation control and additional paralinguistic tags yet to be implemented. A strong usage disclaimer warns against misuse, including unauthorized voice cloning, identity impersonation, fraud, and deepfakes, emphasizing ethical AI practices. Optimal performance is recommended for audio clips under 30 seconds.

Step-Audio-EditX by stepfun-ai

Explore Similar Projects

VoiceStar by jasonppy

GPT-SoVITS-Server by ben0oil1

voicebox-pytorch by lucidrains

f5-tts-mlx by lucasnewman

FireRedTTS by FireRedTeam

GLM-TTS by zai-org

xtts-webui by daswer123

MARS5-TTS by Camb-ai

metavoice-src by metavoiceio

VoiceCraft by jasonppy

Spark-TTS by SparkAudio

GPT-SoVITS by RVC-Boss