Speech-text foundation model for real-time dialogue
Top 5.9% on sourcepulse
Moshi is a real-time, full-duplex spoken dialogue system and foundation model designed for natural, low-latency conversations. It targets researchers and developers building interactive voice applications, offering a significant improvement in speech-to-text latency and quality through its novel neural audio codec, Mimi.
How It Works
Moshi utilizes Mimi, a streaming neural audio codec that compresses 24 kHz audio to a 1.1 kbps representation at 12.5 Hz, achieving an 80ms frame latency. This low-bitrate, low-latency approach is achieved through a combination of encoder/decoder Transformers, stride adaptation, and distillation from WavLM, while using adversarial training for high subjective quality. The system models two audio streams (user and model) and predicts text tokens alongside an "inner monologue" to enhance generation quality, powered by a Depth Transformer and a 7B parameter Temporal Transformer.
Quick Start & Requirements
pip install -U moshi
(PyTorch), pip install -U moshi_mlx
(MLX for macOS), pip install rustymimi
(Rust bindings).Highlighted Details
Maintenance & Community
The project is maintained by kyutai-labs. Further fine-tuning is available via a separate repository.
Licensing & Compatibility
Limitations & Caveats
The PyTorch version does not currently support quantization. Official support for Windows is not provided. The web UI client may experience latency issues when tunneled through external services.
1 day ago
1 day