Step-Audio-EditX  by stepfun-ai

LLM-driven audio model for expressive editing and TTS

Created 2 months ago
817 stars

Top 43.4% on SourcePulse

GitHubView on GitHub
Project Summary

A powerful 3B-parameter LLM-based reinforcement learning model, Step-Audio-EditX addresses complex audio editing tasks by enabling precise control over emotion, speaking style, and paralinguistics. It offers robust zero-shot text-to-speech capabilities, targeting researchers and developers seeking advanced audio manipulation tools. The project provides iterative editing features for nuanced audio refinement.

How It Works

The model architecture comprises a dual-codebook audio tokenizer, an LLM for generating token sequences, and a flow-matching decoder that reconstructs audio waveforms. Iterative control over attributes like emotion and speaking style is achieved through reinforcement learning, leveraging large-margin data during SFT and PPO training for refined audio outputs.

Quick Start & Requirements

  • Installation: Clone the repository, create and activate a Conda environment (python=3.10), install dependencies (pip install -r requirements.txt), and clone the model weights from HuggingFace or ModelScope. Docker support is also available.
  • Prerequisites: Python >= 3.10, PyTorch >= 2.4.1-cu121, CUDA Toolkit, and an NVIDIA GPU with at least 12GB VRAM (16GB recommended). Tested on Linux.
  • Resource Footprint: Requires approximately 12GB GPU memory for inference. Quantization options (INT8, INT4, AWQ 4-bit) are available for reduced memory usage.
  • Links:

Highlighted Details

  • Zero-Shot TTS: Supports voice cloning and style transfer for Mandarin, English, Sichuanese, and Cantonese, utilizing dialect tags.
  • Expressive Control: Enables iterative editing across dozens of emotions (e.g., Happy, Angry, Sad) and speaking styles (e.g., Whisper, Serious, Child).
  • Paralinguistic Editing: Offers precise control over 10 paralinguistic features, including breathing, laughter, sighs, and hesitation sounds.
  • Performance: Demonstrates superior performance compared to Minimax and Doubao in zero-shot cloning and emotion control, with significant improvements achieved through iterative editing.

Maintenance & Community

The project actively releases updates and provides model checkpoints. Feature requests and community feedback are managed via the GitHub Discussions section. Specific community channels (e.g., Discord, Slack) or a public roadmap are not detailed in the provided information.

Licensing & Compatibility

The code is licensed under the Apache 2.0 License, which permits commercial use and integration with closed-source projects.

Limitations & Caveats

The project is under active development, with planned features such as polyphone pronunciation control and additional paralinguistic tags yet to be implemented. A strong usage disclaimer warns against misuse, including unauthorized voice cloning, identity impersonation, fraud, and deepfakes, emphasizing ethical AI practices. Optimal performance is recommended for audio clips under 30 seconds.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
7
Star History
61 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.0%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.4%
54k
Few-shot voice cloning and TTS web UI
Created 2 years ago
Updated 1 week ago
Feedback? Help us improve.