marvis-tts  by Marvis-Labs

Real-time conversational speech synthesis and voice cloning

Created 1 month ago
267 stars

Top 95.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Marvis-TTS is a real-time conversational speech model enabling rapid voice cloning and streaming text-to-speech synthesis. It addresses the need for high-quality, efficient speech generation on consumer devices like Apple Silicon. The primary benefit is enabling natural, real-time voice cloning with minimal audio input and on-device deployment.

How It Works

Marvis is built on the Sesame CSM-1B multimodal transformer architecture, operating directly on Residual Vector Quantization (RVQ) tokens via Kyutai's mimi codec. It employs a dual-transformer design: a 250M parameter multimodal backbone for semantic understanding and a 60M parameter audio decoder for speech reconstruction. This approach allows end-to-end training, low-latency generation, and contextual processing of entire text sequences, avoiding chunking artifacts for more natural intonation and flow.

Quick Start & Requirements

  • Installation: pip install -U mlx-audio
  • Execution: python -m mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.1 --stream --text "..."
  • Prerequisites: Python, MLX, mlx-audio, transformers, torch, soundfile. Optimized for Apple Silicon for edge deployment.
  • Resources: Quantized model is 500MB. GPU recommended for real-time inference.

Highlighted Details

  • Rapid Voice Cloning: Clones voices using only 10 seconds of reference audio.
  • Real-time Streaming: Generates audio chunks as text is processed for conversational flow.
  • Compact Size: Quantized model is approximately 500MB, suitable for on-device inference.
  • Edge Deployment: Optimized for real-time Speech-to-Speech (STS) on mobile devices (iOS, Android).
  • Training Cost: Total training cost estimated at ~$2,000.

Maintenance & Community

  • Creators: Prince Canuma & Lucas Newman.
  • Version: 0.1 (Release Date: 26/08/2025).
  • No community links (Discord, Slack, etc.) are provided in the documentation.

Licensing & Compatibility

  • License: Apache 2.0. This license is permissive and generally allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The model is primarily optimized for English, with potential suboptimal performance on other languages. Voice cloning quality is dependent on the clarity of the 10-second reference audio, and performance degrades with background noise. The model may hallucinate words, particularly for new or short inputs. Users must consider legal and ethical implications regarding voice synthesis and impersonation.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
24 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.6%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.