voxtral.c  by antirez

Pure C speech-to-text inference engine for Mistral Voxtral Realtime 4B

Created 4 days ago

New!

423 stars

Top 69.8% on SourcePulse

GitHubView on GitHub
Project Summary

Pure C inference for Mistral AI's Voxtral Realtime 4B speech-to-text model. It targets developers needing a lightweight, dependency-free ASR solution for embedded systems or performance-critical applications, enabling real-time streaming transcription without Python or heavy ML frameworks.

How It Works

The core is a C implementation of the Voxtral 4B pipeline, relying solely on the C standard library for MPS (Apple Silicon GPU) acceleration, or OpenBLAS for other platforms. It employs a chunked audio encoder with overlapping windows and a rolling KV cache to manage memory efficiently for unlimited audio input lengths. A streaming C API (vox_stream_t) facilitates incremental audio feeding and token string retrieval, supporting direct piping from tools like ffmpeg.

Quick Start & Requirements

  • Build: make mps (Apple Silicon) or make blas (Linux/Intel Mac with OpenBLAS).
  • Model Download: Execute ./download_model.sh (~8.9GB).
  • Transcription: Run ./voxtral -d voxtral-model -i audio.wav or pipe audio via ffmpeg.
  • Prerequisites: Standard C library. BLAS (e.g., OpenBLAS) required for non-MPS builds. Metal GPU for MPS backend.
  • Python Reference: pip install torch safetensors soundfile soxr for understanding the model logic.

Highlighted Details

  • Zero Dependencies: Pure C inference engine requires only the standard C library for MPS builds.
  • Metal GPU Acceleration: Optimized for Apple Silicon Macs, offering fast inference with custom GPU kernels.
  • Streaming C API: Provides vox_stream_t for incremental audio input and token output, suitable for real-time applications.
  • Memory Efficiency: Utilizes memory-mapped weights for fast loading and a rolling KV cache to cap memory usage.
  • Flexible Input: Supports direct audio file transcription or piping any format via ffmpeg to stdin.
  • Alternative Tokens: Option to display competing token candidates when the model is uncertain.

Maintenance & Community

No specific details on maintainers, community channels, or roadmap were found in the provided README.

Licensing & Compatibility

The model weights are licensed under Apache-2.0. The C code itself is provided under the MIT License, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is explicitly stated to require "more testing" and may not be "production quality." Further work is needed, particularly for stress-testing with very long transcriptions to validate KV cache handling.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
6
Issues (30d)
1
Star History
463 stars in the last 4 days

Explore Similar Projects

Starred by Jiaming Song Jiaming Song(Chief Scientist at Luma AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

RealtimeSTT by KoljaB

0.3%
9k
Speech-to-text library for realtime applications
Created 2 years ago
Updated 7 months ago
Feedback? Help us improve.