Discover and explore top open-source AI tools and projects—updated daily.
ScenemaAIZero-shot voice cloning and expressive speech generation
New!
Top 62.4% on SourcePulse
Summary Scenema Audio generates highly expressive and emotionally nuanced speech, overcoming traditional text-to-speech limitations. It allows users to create audio with realistic pacing, breath control, and dynamic emotional arcs, moving beyond mere word pronunciation. The system also offers zero-shot voice cloning, enabling any voice to be replicated from a short reference clip and perform emotions it never originally expressed, ideal for filmmakers, audiobook creators, and content producers.
How It Works
Built on an audio diffusion transformer from LTX 2.3 and leveraging Google's Gemma 3 (12B parameters) for text encoding, Scenema Audio interprets detailed prompts. These include voice characteristics, emotional cues via <action> tags, and environmental context to synthesize speech. A key innovation is its ability to perform emotions and vocal styles not present in the reference voice, with identity transferred via zero-shot cloning from a short audio sample.
Quick Start & Requirements
Installation is recommended via Docker Compose. Users must set a HuggingFace token with Gemma 3 access (export HF_TOKEN=your_huggingface_token). A NVIDIA GPU with at least 16 GB VRAM is required; 24 GB or 48 GB VRAM is recommended. Initial setup involves downloading ~38 GB of model checkpoints. A Gradio web UI is accessible by setting ENABLE_GRADIO=1.
Highlighted Details
<action> tags within an XML prompt format to direct emotional delivery, pacing, and breath control.shot and background_sfx parameters.Maintenance & Community The provided README does not detail specific maintenance schedules, notable contributors, or community channels.
Licensing & Compatibility Model weights are subject to the LTX-2 Community License Agreement. The inference server code is MIT licensed. Use of Gemma 3 requires accepting Google's terms of service and a HuggingFace token, making it a gated component. Commercial use compatibility should be reviewed against the LTX-2 license terms.
Limitations & Caveats The model may occasionally mispronounce complex words. Audio generation is segmented into ~15-second clips, potentially leading to suboptimal splitting for very long sentences. Voice cloning prioritizes identity accuracy, which can limit emotional range extremes. Multilingual speech with language switching may result in incorrect phonetic application. High-quality reference audio is crucial for effective voice cloning.
1 week ago
Inactive
canopyai
myshell-ai