Speech-to-speech translation model for real-time streaming
Top 32.3% on sourcepulse
Hibiki is a real-time speech-to-speech translation model designed for streaming, enabling chunk-by-chunk translation as a user speaks. It targets researchers and developers working on low-latency translation applications, offering natural speech output in the target language with optional voice transfer.
How It Works
Hibiki employs a decoder-only, multistream architecture to jointly model source and target speech, allowing continuous processing of input and generation of output. It produces text and audio tokens at a constant 12.5Hz framerate, facilitating a continuous audio stream with timestamped text translations. Training utilizes synthetic data and a weakly-supervised alignment method leveraging an off-the-shelf MT system to predict when target words are predictable from the source.
Quick Start & Requirements
pip install -U moshi
pip install -U moshi_mlx
cd hibiki-rs && cargo run --features metal -r -- gen <input_audio> <output_audio>
(or --features cuda
)kyutai/hibiki-1b-pytorch-bf16
).python -m moshi_mlx.local_web
.Highlighted Details
Maintenance & Community
The project is associated with Kyutai Labs, with core implementation in the kyutai-labs/moshi
repository.
Licensing & Compatibility
Limitations & Caveats
Currently supports only French-to-English translation. Models are trained on sequences up to 120 seconds with a 40-second context size. The MLX-Swift implementation for iOS is noted as experimental.
3 months ago
Inactive