moshi  by kyutai-labs

Speech-text foundation model for real-time dialogue

created 1 year ago
8,703 stars

Top 5.9% on sourcepulse

GitHubView on GitHub
Project Summary

Moshi is a real-time, full-duplex spoken dialogue system and foundation model designed for natural, low-latency conversations. It targets researchers and developers building interactive voice applications, offering a significant improvement in speech-to-text latency and quality through its novel neural audio codec, Mimi.

How It Works

Moshi utilizes Mimi, a streaming neural audio codec that compresses 24 kHz audio to a 1.1 kbps representation at 12.5 Hz, achieving an 80ms frame latency. This low-bitrate, low-latency approach is achieved through a combination of encoder/decoder Transformers, stride adaptation, and distillation from WavLM, while using adversarial training for high subjective quality. The system models two audio streams (user and model) and predicts text tokens alongside an "inner monologue" to enhance generation quality, powered by a Depth Transformer and a 7B parameter Temporal Transformer.

Quick Start & Requirements

  • Install: pip install -U moshi (PyTorch), pip install -U moshi_mlx (MLX for macOS), pip install rustymimi (Rust bindings).
  • Prerequisites: Python 3.10+ (3.12 recommended). PyTorch version requires a GPU with 24GB VRAM. MLX version is optimized for Apple Silicon (M-series Macs). Rust backend requires the Rust toolchain and optionally CUDA for GPU support.
  • Setup: Minimal setup for Python clients. Rust backend compilation may take longer.
  • Docs: Moshi README, Moshi Finetune, Demo, Hugging Face Models

Highlighted Details

  • Full-duplex spoken dialogue framework with a theoretical latency of 160ms.
  • Mimi codec achieves 1.1 kbps at 12.5 Hz, outperforming non-streaming codecs.
  • Multiple backends available: PyTorch (GPU), MLX (macOS), and Rust (production-ready, with CUDA/Metal support).
  • Offers pre-trained models for Mimi (codec), Moshiko (male voice), and Moshika (female voice) across different backends and quantizations.

Maintenance & Community

The project is maintained by kyutai-labs. Further fine-tuning is available via a separate repository.

Licensing & Compatibility

  • Code: MIT license (Python, Web Client), Apache license (Rust backend).
  • Model Weights: CC-BY 4.0 license.
  • Compatibility: Python components are generally compatible with commercial use. CC-BY 4.0 requires attribution.

Limitations & Caveats

The PyTorch version does not currently support quantization. Official support for Windows is not provided. The web UI client may experience latency issues when tunneled through external services.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
24
Issues (30d)
5
Star History
596 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

JittorLLMs by Jittor

0%
2k
Low-resource LLM inference library
created 2 years ago
updated 5 months ago
Feedback? Help us improve.