delayed-streams-modeling  by kyutai-labs

Streaming multimodal sequence-to-sequence learning

Created 3 months ago
2,371 stars

Top 19.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides implementations for Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning, specifically for Speech-To-Text (STT) and Text-To-Speech (TTS). It targets researchers, developers building real-time voice applications, and those needing on-device AI for Apple silicon, offering significant advantages in latency and efficiency.

How It Works

DSM formalizes a novel approach to streaming X-to-Y tasks, enabling models to process data in chunks for real-time applications. The STT models offer word-level timestamps and include a semantic Voice Activity Detection (VAD) component for voice agents. The TTS models support streaming output and can be quantized for faster inference on resource-constrained devices.

Quick Start & Requirements

  • PyTorch STT: pip install moshi or uvx --with moshi. Run inference with python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio/bria.mp3.
  • Rust STT Server: cargo install --features cuda moshi-server. Start server with moshi-server worker --config configs/config-stt-en_fr-hf.toml. Requires CUDA.
  • MLX STT/TTS: pip install moshi-mlx or uvx --with moshi-mlx. Requires Apple silicon.
  • PyTorch TTS: pip install moshi or uvx --with moshi.
  • Rust TTS Server: Requires moshi-server crate installation via cargo install --features cuda moshi-server and potentially a start_tts.sh script. Requires CUDA.
  • Dependencies: Python 3.x, Rust, Cargo, CUDA (for Rust server), Apple silicon (for MLX).
  • Resources: STT models range from ~1B to 2.6B parameters. H100 can process 400 streams; L40S serves 64 connections at 3x RTF.

Highlighted Details

  • Streaming inference with low latency (e.g., 0.5s delay for 1B STT model).
  • Word-level timestamps for STT.
  • Semantic VAD for STT models.
  • Multiple implementations: PyTorch (research), Rust (production server), MLX (Apple silicon).
  • Prompting capabilities for STT (e.g., influencing spelling, speaker adaptation).

Maintenance & Community

  • Project page and pre-print paper mentioned for more details.
  • Links to Colab notebooks for PyTorch implementations.
  • Rust server code is in a separate moshi repository.

Licensing & Compatibility

  • Python code: MIT License.
  • Rust backend: Apache License.
  • Web client code: MIT License.
  • Model weights: CC-BY 4.0 License.
  • Compatible with commercial use, but model weights have attribution requirements.

Limitations & Caveats

  • The prompt feature for STT is experimental and sensitive to input.
  • Rust TTS server installation can be complex, with a suggestion to open an issue if broken.
  • Specific CUDA versions are not explicitly stated but implied for Rust server functionality.
Health Check
Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
6
Issues (30d)
15
Star History
134 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pietro Schirano Pietro Schirano(Founder of MagicPath), and
2 more.

metavoice-src by metavoiceio

0.1%
4k
TTS model for human-like, expressive speech
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.