Discover and explore top open-source AI tools and projects—updated daily.
Real-time conversational speech synthesis and voice cloning
Top 95.9% on SourcePulse
Marvis-TTS is a real-time conversational speech model enabling rapid voice cloning and streaming text-to-speech synthesis. It addresses the need for high-quality, efficient speech generation on consumer devices like Apple Silicon. The primary benefit is enabling natural, real-time voice cloning with minimal audio input and on-device deployment.
How It Works
Marvis is built on the Sesame CSM-1B multimodal transformer architecture, operating directly on Residual Vector Quantization (RVQ) tokens via Kyutai's mimi codec. It employs a dual-transformer design: a 250M parameter multimodal backbone for semantic understanding and a 60M parameter audio decoder for speech reconstruction. This approach allows end-to-end training, low-latency generation, and contextual processing of entire text sequences, avoiding chunking artifacts for more natural intonation and flow.
Quick Start & Requirements
pip install -U mlx-audio
python -m mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.1 --stream --text "..."
mlx-audio
, transformers
, torch
, soundfile
. Optimized for Apple Silicon for edge deployment.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The model is primarily optimized for English, with potential suboptimal performance on other languages. Voice cloning quality is dependent on the clarity of the 10-second reference audio, and performance degrades with background noise. The model may hallucinate words, particularly for new or short inputs. Users must consider legal and ethical implications regarding voice synthesis and impersonation.
1 month ago
Inactive