marvis-tts by Marvis-Labs

Real-time conversational speech synthesis and voice cloning

Created 6 months ago

351 stars

Top 79.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Dan Guido

Cofounder of Trail of Bits

Project Summary

Marvis-TTS is a real-time conversational speech model enabling rapid voice cloning and streaming text-to-speech synthesis. It addresses the need for high-quality, efficient speech generation on consumer devices like Apple Silicon. The primary benefit is enabling natural, real-time voice cloning with minimal audio input and on-device deployment.

How It Works

Marvis is built on the Sesame CSM-1B multimodal transformer architecture, operating directly on Residual Vector Quantization (RVQ) tokens via Kyutai's mimi codec. It employs a dual-transformer design: a 250M parameter multimodal backbone for semantic understanding and a 60M parameter audio decoder for speech reconstruction. This approach allows end-to-end training, low-latency generation, and contextual processing of entire text sequences, avoiding chunking artifacts for more natural intonation and flow.

Quick Start & Requirements

Installation: pip install -U mlx-audio
Execution: python -m mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.1 --stream --text "..."
Prerequisites: Python, MLX, mlx-audio, transformers, torch, soundfile. Optimized for Apple Silicon for edge deployment.
Resources: Quantized model is 500MB. GPU recommended for real-time inference.

Highlighted Details

Rapid Voice Cloning: Clones voices using only 10 seconds of reference audio.
Real-time Streaming: Generates audio chunks as text is processed for conversational flow.
Compact Size: Quantized model is approximately 500MB, suitable for on-device inference.
Edge Deployment: Optimized for real-time Speech-to-Speech (STS) on mobile devices (iOS, Android).
Training Cost: Total training cost estimated at ~$2,000.

Maintenance & Community

Creators: Prince Canuma & Lucas Newman.
Version: 0.1 (Release Date: 26/08/2025).
No community links (Discord, Slack, etc.) are provided in the documentation.

Licensing & Compatibility

License: Apache 2.0. This license is permissive and generally allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The model is primarily optimized for English, with potential suboptimal performance on other languages. Voice cloning quality is dependent on the clarity of the 10-second reference audio, and performance degrades with background noise. The model may hallucinate words, particularly for new or short inputs. Users must consider legal and ethical implications regarding voice synthesis and impersonation.

Health Check

Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days