hibiki  by kyutai-labs

Speech-to-speech translation model for real-time streaming

Created 7 months ago
1,271 stars

Top 31.2% on SourcePulse

GitHubView on GitHub
Project Summary

Hibiki is a real-time speech-to-speech translation model designed for streaming, enabling chunk-by-chunk translation as a user speaks. It targets researchers and developers working on low-latency translation applications, offering natural speech output in the target language with optional voice transfer.

How It Works

Hibiki employs a decoder-only, multistream architecture to jointly model source and target speech, allowing continuous processing of input and generation of output. It produces text and audio tokens at a constant 12.5Hz framerate, facilitating a continuous audio stream with timestamped text translations. Training utilizes synthetic data and a weakly-supervised alignment method leveraging an off-the-shelf MT system to predict when target words are predictable from the source.

Quick Start & Requirements

  • PyTorch: pip install -U moshi
  • MLX: pip install -U moshi_mlx
  • Rust: cd hibiki-rs && cargo run --features metal -r -- gen <input_audio> <output_audio> (or --features cuda)
  • Prerequisites: Python 3.x, PyTorch, MLX (for macOS/iOS), Rust.
  • Models: Available on HuggingFace (e.g., kyutai/hibiki-1b-pytorch-bf16).
  • Demo: Real-time web UI via python -m moshi_mlx.local_web.
  • Docs: https://github.com/kyutai-labs/moshi

Highlighted Details

  • Supports voice transfer with controllable fidelity via Classifier-Free Guidance.
  • Offers models optimized for on-device inference (Hibiki 1B) and higher fidelity (Hibiki 2B).
  • Inference is compatible with batching due to simple temperature sampling.
  • Experimental MLX-Swift implementation available for iOS.

Maintenance & Community

The project is associated with Kyutai Labs, with core implementation in the kyutai-labs/moshi repository.

Licensing & Compatibility

  • Code: MIT License (Python), Apache License (Rust).
  • Model Weights: CC-BY 4.0 License. This license allows commercial use and distribution, but requires attribution.

Limitations & Caveats

Currently supports only French-to-English translation. Models are trained on sequences up to 120 seconds with a 40-second context size. The MLX-Swift implementation for iOS is noted as experimental.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 30 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), and
3 more.

espnet by espnet

0.2%
9k
End-to-end speech processing toolkit for various speech tasks
Created 7 years ago
Updated 3 days ago
Feedback? Help us improve.