hibiki by kyutai-labs

Speech-to-speech translation model for real-time streaming

Created 11 months ago

1,351 stars

Top 29.5% on SourcePulse

View on GitHub

5 Experts Love This Project

Laurent Mazare

Cofounder of Kyutai

Luis Capelo

Cofounder of Lightning AI

Thomas Wolf

Cofounder of Hugging Face

David Cournapeau

Author of scikit-learn

and 1 more!

Project Summary

Hibiki is a real-time speech-to-speech translation model designed for streaming, enabling chunk-by-chunk translation as a user speaks. It targets researchers and developers working on low-latency translation applications, offering natural speech output in the target language with optional voice transfer.

How It Works

Hibiki employs a decoder-only, multistream architecture to jointly model source and target speech, allowing continuous processing of input and generation of output. It produces text and audio tokens at a constant 12.5Hz framerate, facilitating a continuous audio stream with timestamped text translations. Training utilizes synthetic data and a weakly-supervised alignment method leveraging an off-the-shelf MT system to predict when target words are predictable from the source.

Quick Start & Requirements

PyTorch: pip install -U moshi
MLX: pip install -U moshi_mlx
Rust: cd hibiki-rs && cargo run --features metal -r -- gen <input_audio> <output_audio> (or --features cuda)
Prerequisites: Python 3.x, PyTorch, MLX (for macOS/iOS), Rust.
Models: Available on HuggingFace (e.g., kyutai/hibiki-1b-pytorch-bf16).
Demo: Real-time web UI via python -m moshi_mlx.local_web.
Docs: https://github.com/kyutai-labs/moshi

Highlighted Details

Supports voice transfer with controllable fidelity via Classifier-Free Guidance.
Offers models optimized for on-device inference (Hibiki 1B) and higher fidelity (Hibiki 2B).
Inference is compatible with batching due to simple temperature sampling.
Experimental MLX-Swift implementation available for iOS.

Maintenance & Community

The project is associated with Kyutai Labs, with core implementation in the kyutai-labs/moshi repository.

Licensing & Compatibility

Code: MIT License (Python), Apache License (Rust).
Model Weights: CC-BY 4.0 License. This license allows commercial use and distribution, but requires attribution.

Limitations & Caveats

Currently supports only French-to-English translation. Models are trained on sequences up to 120 seconds with a 40-second context size. The MLX-Swift implementation for iOS is noted as experimental.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

16 stars in the last 30 days