delayed-streams-modeling  by kyutai-labs

Streaming multimodal sequence-to-sequence learning

created 1 month ago
2,116 stars

Top 21.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides implementations for Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning, specifically for Speech-To-Text (STT) and Text-To-Speech (TTS). It targets researchers, developers building real-time voice applications, and those needing on-device AI for Apple silicon, offering significant advantages in latency and efficiency.

How It Works

DSM formalizes a novel approach to streaming X-to-Y tasks, enabling models to process data in chunks for real-time applications. The STT models offer word-level timestamps and include a semantic Voice Activity Detection (VAD) component for voice agents. The TTS models support streaming output and can be quantized for faster inference on resource-constrained devices.

Quick Start & Requirements

  • PyTorch STT: pip install moshi or uvx --with moshi. Run inference with python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio/bria.mp3.
  • Rust STT Server: cargo install --features cuda moshi-server. Start server with moshi-server worker --config configs/config-stt-en_fr-hf.toml. Requires CUDA.
  • MLX STT/TTS: pip install moshi-mlx or uvx --with moshi-mlx. Requires Apple silicon.
  • PyTorch TTS: pip install moshi or uvx --with moshi.
  • Rust TTS Server: Requires moshi-server crate installation via cargo install --features cuda moshi-server and potentially a start_tts.sh script. Requires CUDA.
  • Dependencies: Python 3.x, Rust, Cargo, CUDA (for Rust server), Apple silicon (for MLX).
  • Resources: STT models range from ~1B to 2.6B parameters. H100 can process 400 streams; L40S serves 64 connections at 3x RTF.

Highlighted Details

  • Streaming inference with low latency (e.g., 0.5s delay for 1B STT model).
  • Word-level timestamps for STT.
  • Semantic VAD for STT models.
  • Multiple implementations: PyTorch (research), Rust (production server), MLX (Apple silicon).
  • Prompting capabilities for STT (e.g., influencing spelling, speaker adaptation).

Maintenance & Community

  • Project page and pre-print paper mentioned for more details.
  • Links to Colab notebooks for PyTorch implementations.
  • Rust server code is in a separate moshi repository.

Licensing & Compatibility

  • Python code: MIT License.
  • Rust backend: Apache License.
  • Web client code: MIT License.
  • Model weights: CC-BY 4.0 License.
  • Compatible with commercial use, but model weights have attribution requirements.

Limitations & Caveats

  • The prompt feature for STT is experimental and sensitive to input.
  • Rust TTS server installation can be complex, with a suggestion to open an issue if broken.
  • Specific CUDA versions are not explicitly stated but implied for Rust server functionality.
Health Check
Last commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
17
Issues (30d)
37
Star History
2,206 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.