Discover and explore top open-source AI tools and projects—updated daily.
kyutai-labsStreaming multimodal sequence-to-sequence learning
Top 18.3% on SourcePulse
This repository provides implementations for Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning, specifically for Speech-To-Text (STT) and Text-To-Speech (TTS). It targets researchers, developers building real-time voice applications, and those needing on-device AI for Apple silicon, offering significant advantages in latency and efficiency.
How It Works
DSM formalizes a novel approach to streaming X-to-Y tasks, enabling models to process data in chunks for real-time applications. The STT models offer word-level timestamps and include a semantic Voice Activity Detection (VAD) component for voice agents. The TTS models support streaming output and can be quantized for faster inference on resource-constrained devices.
Quick Start & Requirements
pip install moshi or uvx --with moshi. Run inference with python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio/bria.mp3.cargo install --features cuda moshi-server. Start server with moshi-server worker --config configs/config-stt-en_fr-hf.toml. Requires CUDA.pip install moshi-mlx or uvx --with moshi-mlx. Requires Apple silicon.pip install moshi or uvx --with moshi.moshi-server crate installation via cargo install --features cuda moshi-server and potentially a start_tts.sh script. Requires CUDA.Highlighted Details
Maintenance & Community
moshi repository.Licensing & Compatibility
Limitations & Caveats
1 month ago
1 day
metavoiceio
m-bain