scenema-audio by ScenemaAI

Zero-shot voice cloning and expressive speech generation

Created 2 months ago

530 stars

Top 58.9% on SourcePulse

Project Summary

Summary Scenema Audio generates highly expressive and emotionally nuanced speech, overcoming traditional text-to-speech limitations. It allows users to create audio with realistic pacing, breath control, and dynamic emotional arcs, moving beyond mere word pronunciation. The system also offers zero-shot voice cloning, enabling any voice to be replicated from a short reference clip and perform emotions it never originally expressed, ideal for filmmakers, audiobook creators, and content producers.

How It Works Built on an audio diffusion transformer from LTX 2.3 and leveraging Google's Gemma 3 (12B parameters) for text encoding, Scenema Audio interprets detailed prompts. These include voice characteristics, emotional cues via <action> tags, and environmental context to synthesize speech. A key innovation is its ability to perform emotions and vocal styles not present in the reference voice, with identity transferred via zero-shot cloning from a short audio sample.

Quick Start & Requirements Installation is recommended via Docker Compose. Users must set a HuggingFace token with Gemma 3 access (export HF_TOKEN=your_huggingface_token). A NVIDIA GPU with at least 16 GB VRAM is required; 24 GB or 48 GB VRAM is recommended. Initial setup involves downloading ~38 GB of model checkpoints. A Gradio web UI is accessible by setting ENABLE_GRADIO=1.

Highlighted Details

Zero-Shot Voice Cloning: Replicates voice identity from 10-20 seconds of reference audio, enabling any voice to perform any emotion.
Expressive Performance Control: Utilizes detailed voice descriptions and <action> tags within an XML prompt format to direct emotional delivery, pacing, and breath control.
Scene-Aware Audio: Generates speech integrated with environmental sounds and ambient audio via shot and background_sfx parameters.
Multilingual Support: Capable of generating speech in major world languages with native-sounding output.
Long-Form Narration: Automatically splits text into ~15-second segments, maintaining voice continuity across chunks.

Maintenance & Community The provided README does not detail specific maintenance schedules, notable contributors, or community channels.

Licensing & Compatibility Model weights are subject to the LTX-2 Community License Agreement. The inference server code is MIT licensed. Use of Gemma 3 requires accepting Google's terms of service and a HuggingFace token, making it a gated component. Commercial use compatibility should be reviewed against the LTX-2 license terms.

Limitations & Caveats The model may occasionally mispronounce complex words. Audio generation is segmented into ~15-second clips, potentially leading to suboptimal splitting for very long sentences. Voice cloning prioritizes identity accuracy, which can limit emotional range extremes. Multilingual speech with language switching may result in incorrect phonetic application. High-quality reference audio is crucial for effective voice cloning.

scenema-audio by ScenemaAI

Explore Similar Projects

ComfyUI-FishAudioS2 by Saganaki22

ComfyUI-F5-TTS by niknah

ComfyUI-VoxCPM by wildminder

ComfyUI_IndexTTS by billwuhao

Voice-Clone-Studio by FranckyB

ComfyUI-Qwen-TTS by flybirdxx

Orpheus-TTS by canopyai

higgs-audio by boson-ai

VALL-E-X by Plachtaa

Zonos by Zyphra

Qwen3-TTS by QwenLM

OpenVoice by myshell-ai