hibiki  by kyutai-labs

Speech-to-speech translation model for real-time streaming

created 5 months ago
1,248 stars

Top 32.3% on sourcepulse

GitHubView on GitHub
Project Summary

Hibiki is a real-time speech-to-speech translation model designed for streaming, enabling chunk-by-chunk translation as a user speaks. It targets researchers and developers working on low-latency translation applications, offering natural speech output in the target language with optional voice transfer.

How It Works

Hibiki employs a decoder-only, multistream architecture to jointly model source and target speech, allowing continuous processing of input and generation of output. It produces text and audio tokens at a constant 12.5Hz framerate, facilitating a continuous audio stream with timestamped text translations. Training utilizes synthetic data and a weakly-supervised alignment method leveraging an off-the-shelf MT system to predict when target words are predictable from the source.

Quick Start & Requirements

  • PyTorch: pip install -U moshi
  • MLX: pip install -U moshi_mlx
  • Rust: cd hibiki-rs && cargo run --features metal -r -- gen <input_audio> <output_audio> (or --features cuda)
  • Prerequisites: Python 3.x, PyTorch, MLX (for macOS/iOS), Rust.
  • Models: Available on HuggingFace (e.g., kyutai/hibiki-1b-pytorch-bf16).
  • Demo: Real-time web UI via python -m moshi_mlx.local_web.
  • Docs: https://github.com/kyutai-labs/moshi

Highlighted Details

  • Supports voice transfer with controllable fidelity via Classifier-Free Guidance.
  • Offers models optimized for on-device inference (Hibiki 1B) and higher fidelity (Hibiki 2B).
  • Inference is compatible with batching due to simple temperature sampling.
  • Experimental MLX-Swift implementation available for iOS.

Maintenance & Community

The project is associated with Kyutai Labs, with core implementation in the kyutai-labs/moshi repository.

Licensing & Compatibility

  • Code: MIT License (Python), Apache License (Rust).
  • Model Weights: CC-BY 4.0 License. This license allows commercial use and distribution, but requires attribution.

Limitations & Caveats

Currently supports only French-to-English translation. Models are trained on sequences up to 120 seconds with a 40-second context size. The MLX-Swift implementation for iOS is noted as experimental.

Health Check
Last commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
242 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Lianmin Zheng Lianmin Zheng(Author of SGLang).

fish-speech by fishaudio

0.3%
23k
Open-source TTS for multilingual speech synthesis
created 1 year ago
updated 1 week ago
Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
19 more.

whisper by openai

0.4%
86k
Speech recognition model for multilingual transcription/translation
created 2 years ago
updated 1 month ago
Feedback? Help us improve.