SoulX-Podcast  by Soul-AILab

Realistic long-form podcast generation from text

Created 1 month ago
1,412 stars

Top 28.7% on SourcePulse

GitHubView on GitHub
Project Summary

SoulX-Podcast is an inference codebase for generating high-fidelity, long-form podcasts from text. It targets users needing realistic multi-turn, multi-speaker dialogic speech synthesis, offering advanced features like cross-dialectal zero-shot voice cloning and paralinguistic controls for enhanced naturalness and personalization.

How It Works

The project focuses on realistic long-form podcast generation, excelling in multi-turn, multi-speaker dialogic speech synthesis. It integrates a range of paralinguistic controls (e.g., laughter, sighs) to enhance realism. A key novelty is its support for cross-dialectal, zero-shot voice cloning, enabling personalized speech generation across various Chinese dialects (Sichuanese, Henanese, Cantonese) and Mandarin/English, using prompt audio samples.

Quick Start & Requirements

  • Installation: Clone the repo (git clone git@github.com:Soul-AILab/SoulX-Podcast.git), create a Conda environment with Python 3.11 (conda create -n soulxpodcast -y python=3.11), activate it (conda activate soulxpodcast), and install requirements (pip install -r requirements.txt).
  • Model Download: Download base and dialectal models (1.7B parameters) via huggingface-cli or Python snapshot_download. Git LFS is required for git clone download.
  • Prerequisites: Conda, Python 3.11, Git LFS.
  • Usage: Basic inference can be run via bash example/infer_dialogue.sh.
  • Links: Demo page: https://soul-ailab.github.io/soulx-podcast/. Paper: https://arxiv.org/pdf/2510.23541. Hugging Face models: https://huggingface.co/collections/Soul-AILab/soulx-podcast.

Highlighted Details

  • Generates long-form, multi-turn, multi-speaker dialogic speech.
  • Supports cross-dialectal, zero-shot voice cloning for personalized speech.
  • Integrates paralinguistic controls like laughter and sighs for enhanced realism.

Maintenance & Community

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Researchers and developers are free to use codes and model weights. The Apache 2.0 license generally permits commercial use and linking with closed-source projects.

Limitations & Caveats

  • The project is primarily an inference codebase; example scripts for monologue TTS are pending.
  • A WebUI, online demo, and Docker support are planned but not yet implemented.
  • Streaming inference is also a future development goal.
  • A usage disclaimer strongly advises against misuse for unauthorized voice cloning, impersonation, fraud, or illegal activities, emphasizing ethical standards and responsible AI use.
Health Check
Last Commit

11 hours ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
18
Star History
1,468 stars in the last 30 days

Explore Similar Projects

Starred by Tim J. Baek Tim J. Baek(Founder of Open WebUI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

StyleTTS2 by yl4579

0.3%
6k
Text-to-speech model achieving human-level synthesis
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Starred by Jiaming Song Jiaming Song(Chief Scientist at Luma AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

fish-speech by fishaudio

0.4%
24k
Open-source TTS for multilingual speech synthesis
Created 2 years ago
Updated 18 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), and
6 more.

OpenVoice by myshell-ai

0.4%
35k
Audio foundation model for versatile, instant voice cloning
Created 1 year ago
Updated 6 months ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
3 more.

ChatTTS by 2noise

0.1%
38k
Generative speech model for daily dialogue
Created 1 year ago
Updated 4 months ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

GPT-SoVITS by RVC-Boss

0.3%
52k
Few-shot voice cloning and TTS web UI
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.