delayed-streams-modeling by kyutai-labs

Streaming multimodal sequence-to-sequence learning

Created 6 months ago

2,707 stars

Top 17.4% on SourcePulse

3 Experts Love This Project

patrickvonplaten

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

jiamings

Chief Scientist at Luma AI

LaurentMazare

Cofounder of Kyutai

Project Summary

This repository provides implementations for Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning, specifically for Speech-To-Text (STT) and Text-To-Speech (TTS). It targets researchers, developers building real-time voice applications, and those needing on-device AI for Apple silicon, offering significant advantages in latency and efficiency.

How It Works

DSM formalizes a novel approach to streaming X-to-Y tasks, enabling models to process data in chunks for real-time applications. The STT models offer word-level timestamps and include a semantic Voice Activity Detection (VAD) component for voice agents. The TTS models support streaming output and can be quantized for faster inference on resource-constrained devices.

Quick Start & Requirements

PyTorch STT: pip install moshi or uvx --with moshi. Run inference with python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio/bria.mp3.
Rust STT Server: cargo install --features cuda moshi-server. Start server with moshi-server worker --config configs/config-stt-en_fr-hf.toml. Requires CUDA.
MLX STT/TTS: pip install moshi-mlx or uvx --with moshi-mlx. Requires Apple silicon.
PyTorch TTS: pip install moshi or uvx --with moshi.
Rust TTS Server: Requires moshi-server crate installation via cargo install --features cuda moshi-server and potentially a start_tts.sh script. Requires CUDA.
Dependencies: Python 3.x, Rust, Cargo, CUDA (for Rust server), Apple silicon (for MLX).
Resources: STT models range from ~1B to 2.6B parameters. H100 can process 400 streams; L40S serves 64 connections at 3x RTF.

Highlighted Details

Streaming inference with low latency (e.g., 0.5s delay for 1B STT model).
Word-level timestamps for STT.
Semantic VAD for STT models.
Multiple implementations: PyTorch (research), Rust (production server), MLX (Apple silicon).
Prompting capabilities for STT (e.g., influencing spelling, speaker adaptation).

Maintenance & Community

Project page and pre-print paper mentioned for more details.
Links to Colab notebooks for PyTorch implementations.
Rust server code is in a separate moshi repository.

Licensing & Compatibility

Python code: MIT License.
Rust backend: Apache License.
Web client code: MIT License.
Model weights: CC-BY 4.0 License.
Compatible with commercial use, but model weights have attribution requirements.

Limitations & Caveats

The prompt feature for STT is experimental and sensitive to input.
Rust TTS server installation can be complex, with a suggestion to open an issue if broken.
Specific CUDA versions are not explicitly stated but implied for Rust server functionality.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

1

Issues (30d)

5

Star History

66 stars in the last 30 days

Explore Similar Projects

ocotillo by neonbjb

Speech recognition model built on PyTorch

Created 4 years ago

Updated 3 years ago

wingmanAI by e-johnstonn

Real-time transcription tool with ChatGPT integration

Created 2 years ago

Updated 2 years ago

RealtimeSTT_LLM_TTS by Ikaros-521

Realtime STT/TTS pipeline for cross-network, real-time conversations

Created 1 year ago

Updated 1 year ago

VITA-Audio by VITA-MLLM

Speech model for fast audio-text token generation

Created 8 months ago

Updated 7 months ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera),

Theo Browne

Theo Browne(Founder of Ping.gg), and

1 more.

dia2 by nari-labs

Streaming dialogue TTS for real-time conversational audio

Created 1 month ago

Updated 1 month ago

fast-voice-assistant by dsa

AI voice assistant demo with <500ms response

Created 1 year ago

Updated 1 year ago

Starred by

Laurent Mazare

Laurent Mazare(Cofounder of Kyutai),

Luis Capelo

Luis Capelo(Cofounder of Lightning AI), and

3 more.

hibiki by kyutai-labs

Speech-to-speech translation model for real-time streaming

Created 11 months ago

Updated 9 months ago

QuickAgent by gkamradt

Voice bot demo using speech and language models

Created 1 year ago

Updated 1 year ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI),

Jesse Clark

Jesse Clark(Cofounder of Marqo), and

3 more.

Qwen2.5-Omni by QwenLM

Multimodal model for text, audio, vision, and video processing with real-time speech generation

Created 9 months ago

Updated 7 months ago

Starred by

Amin Ahmad

Amin Ahmad(Cofounder of Vectara) and

Jeremy Howard

Jeremy Howard(Cofounder of fast.ai).

whisper_streaming by ufal

Real-time streaming for long speech-to-text transcription/translation

Created 2 years ago

Updated 2 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Pietro Schirano

Pietro Schirano(Founder of MagicPath), and

2 more.

metavoice-src by metavoiceio

TTS model for human-like, expressive speech

Created 1 year ago

Updated 1 year ago

CosyVoice by FunAudioLLM

Voice generation model for inference, training, and deployment

Created 1 year ago

Updated 4 days ago

Feedback? Help us improve.