moshi by kyutai-labs

Speech-text foundation model for real-time dialogue

Created 1 year ago

9,256 stars

Top 5.5% on SourcePulse

View on GitHub

12 Experts Love This Project

Cofounder of Hugging Face

Omar Sanseviero

DevRel at Google DeepMind

and 8 more!

Project Summary

Moshi is a real-time, full-duplex spoken dialogue system and foundation model designed for natural, low-latency conversations. It targets researchers and developers building interactive voice applications, offering a significant improvement in speech-to-text latency and quality through its novel neural audio codec, Mimi.

How It Works

Moshi utilizes Mimi, a streaming neural audio codec that compresses 24 kHz audio to a 1.1 kbps representation at 12.5 Hz, achieving an 80ms frame latency. This low-bitrate, low-latency approach is achieved through a combination of encoder/decoder Transformers, stride adaptation, and distillation from WavLM, while using adversarial training for high subjective quality. The system models two audio streams (user and model) and predicts text tokens alongside an "inner monologue" to enhance generation quality, powered by a Depth Transformer and a 7B parameter Temporal Transformer.

Quick Start & Requirements

Install: pip install -U moshi (PyTorch), pip install -U moshi_mlx (MLX for macOS), pip install rustymimi (Rust bindings).
Prerequisites: Python 3.10+ (3.12 recommended). PyTorch version requires a GPU with 24GB VRAM. MLX version is optimized for Apple Silicon (M-series Macs). Rust backend requires the Rust toolchain and optionally CUDA for GPU support.
Setup: Minimal setup for Python clients. Rust backend compilation may take longer.
Docs: Moshi README, Moshi Finetune, Demo, Hugging Face Models

Highlighted Details

Full-duplex spoken dialogue framework with a theoretical latency of 160ms.
Mimi codec achieves 1.1 kbps at 12.5 Hz, outperforming non-streaming codecs.
Multiple backends available: PyTorch (GPU), MLX (macOS), and Rust (production-ready, with CUDA/Metal support).
Offers pre-trained models for Mimi (codec), Moshiko (male voice), and Moshika (female voice) across different backends and quantizations.

Maintenance & Community

The project is maintained by kyutai-labs. Further fine-tuning is available via a separate repository.

Licensing & Compatibility

Code: MIT license (Python, Web Client), Apache license (Rust backend).
Model Weights: CC-BY 4.0 license.
Compatibility: Python components are generally compatible with commercial use. CC-BY 4.0 requires attribution.

Limitations & Caveats

The PyTorch version does not currently support quantization. Official support for Windows is not provided. The web UI client may experience latency issues when tunneled through external services.

Health Check

Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

107 stars in the last 30 days