Streaming multimodal sequence-to-sequence learning
Top 21.7% on sourcepulse
This repository provides implementations for Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning, specifically for Speech-To-Text (STT) and Text-To-Speech (TTS). It targets researchers, developers building real-time voice applications, and those needing on-device AI for Apple silicon, offering significant advantages in latency and efficiency.
How It Works
DSM formalizes a novel approach to streaming X-to-Y tasks, enabling models to process data in chunks for real-time applications. The STT models offer word-level timestamps and include a semantic Voice Activity Detection (VAD) component for voice agents. The TTS models support streaming output and can be quantized for faster inference on resource-constrained devices.
Quick Start & Requirements
pip install moshi
or uvx --with moshi
. Run inference with python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio/bria.mp3
.cargo install --features cuda moshi-server
. Start server with moshi-server worker --config configs/config-stt-en_fr-hf.toml
. Requires CUDA.pip install moshi-mlx
or uvx --with moshi-mlx
. Requires Apple silicon.pip install moshi
or uvx --with moshi
.moshi-server
crate installation via cargo install --features cuda moshi-server
and potentially a start_tts.sh
script. Requires CUDA.Highlighted Details
Maintenance & Community
moshi
repository.Licensing & Compatibility
Limitations & Caveats
2 days ago
Inactive