scenema-audio  by ScenemaAI

Zero-shot voice cloning and expressive speech generation

Created 3 weeks ago

New!

491 stars

Top 62.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary Scenema Audio generates highly expressive and emotionally nuanced speech, overcoming traditional text-to-speech limitations. It allows users to create audio with realistic pacing, breath control, and dynamic emotional arcs, moving beyond mere word pronunciation. The system also offers zero-shot voice cloning, enabling any voice to be replicated from a short reference clip and perform emotions it never originally expressed, ideal for filmmakers, audiobook creators, and content producers.

How It Works Built on an audio diffusion transformer from LTX 2.3 and leveraging Google's Gemma 3 (12B parameters) for text encoding, Scenema Audio interprets detailed prompts. These include voice characteristics, emotional cues via <action> tags, and environmental context to synthesize speech. A key innovation is its ability to perform emotions and vocal styles not present in the reference voice, with identity transferred via zero-shot cloning from a short audio sample.

Quick Start & Requirements Installation is recommended via Docker Compose. Users must set a HuggingFace token with Gemma 3 access (export HF_TOKEN=your_huggingface_token). A NVIDIA GPU with at least 16 GB VRAM is required; 24 GB or 48 GB VRAM is recommended. Initial setup involves downloading ~38 GB of model checkpoints. A Gradio web UI is accessible by setting ENABLE_GRADIO=1.

Highlighted Details

  • Zero-Shot Voice Cloning: Replicates voice identity from 10-20 seconds of reference audio, enabling any voice to perform any emotion.
  • Expressive Performance Control: Utilizes detailed voice descriptions and <action> tags within an XML prompt format to direct emotional delivery, pacing, and breath control.
  • Scene-Aware Audio: Generates speech integrated with environmental sounds and ambient audio via shot and background_sfx parameters.
  • Multilingual Support: Capable of generating speech in major world languages with native-sounding output.
  • Long-Form Narration: Automatically splits text into ~15-second segments, maintaining voice continuity across chunks.

Maintenance & Community The provided README does not detail specific maintenance schedules, notable contributors, or community channels.

Licensing & Compatibility Model weights are subject to the LTX-2 Community License Agreement. The inference server code is MIT licensed. Use of Gemma 3 requires accepting Google's terms of service and a HuggingFace token, making it a gated component. Commercial use compatibility should be reviewed against the LTX-2 license terms.

Limitations & Caveats The model may occasionally mispronounce complex words. Audio generation is segmented into ~15-second clips, potentially leading to suboptimal splitting for very long sentences. Voice cloning prioritizes identity accuracy, which can limit emotional range extremes. Multilingual speech with language switching may result in incorrect phonetic application. High-quality reference audio is crucial for effective voice cloning.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
9
Star History
491 stars in the last 24 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

Orpheus-TTS by canopyai

0.1%
6k
Open-source TTS for human-sounding speech, built on Llama-3b
Created 1 year ago
Updated 5 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Junyang Lin Junyang Lin(Core Maintainer at Alibaba Qwen), and
6 more.

OpenVoice by myshell-ai

0.1%
37k
Audio foundation model for versatile, instant voice cloning
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.