SoulX-Podcast by Soul-AILab

Realistic long-form podcast generation from text

Created 1 month ago

1,412 stars

Top 28.7% on SourcePulse

Project Summary

SoulX-Podcast is an inference codebase for generating high-fidelity, long-form podcasts from text. It targets users needing realistic multi-turn, multi-speaker dialogic speech synthesis, offering advanced features like cross-dialectal zero-shot voice cloning and paralinguistic controls for enhanced naturalness and personalization.

How It Works

The project focuses on realistic long-form podcast generation, excelling in multi-turn, multi-speaker dialogic speech synthesis. It integrates a range of paralinguistic controls (e.g., laughter, sighs) to enhance realism. A key novelty is its support for cross-dialectal, zero-shot voice cloning, enabling personalized speech generation across various Chinese dialects (Sichuanese, Henanese, Cantonese) and Mandarin/English, using prompt audio samples.

Quick Start & Requirements

Installation: Clone the repo (git clone git@github.com:Soul-AILab/SoulX-Podcast.git), create a Conda environment with Python 3.11 (conda create -n soulxpodcast -y python=3.11), activate it (conda activate soulxpodcast), and install requirements (pip install -r requirements.txt).
Model Download: Download base and dialectal models (1.7B parameters) via huggingface-cli or Python snapshot_download. Git LFS is required for git clone download.
Prerequisites: Conda, Python 3.11, Git LFS.
Usage: Basic inference can be run via bash example/infer_dialogue.sh.
Links: Demo page: https://soul-ailab.github.io/soulx-podcast/. Paper: https://arxiv.org/pdf/2510.23541. Hugging Face models: https://huggingface.co/collections/Soul-AILab/soulx-podcast.

Highlighted Details

Generates long-form, multi-turn, multi-speaker dialogic speech.
Supports cross-dialectal, zero-shot voice cloning for personalized speech.
Integrates paralinguistic controls like laughter and sighs for enhanced realism.

Maintenance & Community

Paper published: https://arxiv.org/pdf/2510.23541.
Models available on Hugging Face: https://huggingface.co/collections/Soul-AILab/soulx-podcast.
Demo page: https://soul-ailab.github.io/soulx-podcast/.
Contact emails provided for inquiries.
A WeChat group is available for technical discussions.
Future development includes a WebUI, online demo, and Docker support.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Researchers and developers are free to use codes and model weights. The Apache 2.0 license generally permits commercial use and linking with closed-source projects.

Limitations & Caveats

The project is primarily an inference codebase; example scripts for monologue TTS are pending.
A WebUI, online demo, and Docker support are planned but not yet implemented.
Streaming inference is also a future development goal.
A usage disclaimer strongly advises against misuse for unauthorized voice cloning, impersonation, fraud, or illegal activities, emphasizing ethical standards and responsible AI use.

Health Check

Last Commit

11 hours ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1,468 stars in the last 30 days