voxtral.c by antirez

Pure C speech-to-text inference engine for Mistral Voxtral Realtime 4B

Created 4 days ago

New!

423 stars

Top 69.8% on SourcePulse

View on GitHub

4 Experts Love This Project

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

Luis Capelo

Cofounder of Lightning AI

Phil Wang

Prolific Research Paper Implementer

Simon Willison

Coauthor of Django

Project Summary

Pure C inference for Mistral AI's Voxtral Realtime 4B speech-to-text model. It targets developers needing a lightweight, dependency-free ASR solution for embedded systems or performance-critical applications, enabling real-time streaming transcription without Python or heavy ML frameworks.

How It Works

The core is a C implementation of the Voxtral 4B pipeline, relying solely on the C standard library for MPS (Apple Silicon GPU) acceleration, or OpenBLAS for other platforms. It employs a chunked audio encoder with overlapping windows and a rolling KV cache to manage memory efficiently for unlimited audio input lengths. A streaming C API (vox_stream_t) facilitates incremental audio feeding and token string retrieval, supporting direct piping from tools like ffmpeg.

Quick Start & Requirements

Build: make mps (Apple Silicon) or make blas (Linux/Intel Mac with OpenBLAS).
Model Download: Execute ./download_model.sh (~8.9GB).
Transcription: Run ./voxtral -d voxtral-model -i audio.wav or pipe audio via ffmpeg.
Prerequisites: Standard C library. BLAS (e.g., OpenBLAS) required for non-MPS builds. Metal GPU for MPS backend.
Python Reference: pip install torch safetensors soundfile soxr for understanding the model logic.

Highlighted Details

Zero Dependencies: Pure C inference engine requires only the standard C library for MPS builds.
Metal GPU Acceleration: Optimized for Apple Silicon Macs, offering fast inference with custom GPU kernels.
Streaming C API: Provides vox_stream_t for incremental audio input and token output, suitable for real-time applications.
Memory Efficiency: Utilizes memory-mapped weights for fast loading and a rolling KV cache to cap memory usage.
Flexible Input: Supports direct audio file transcription or piping any format via ffmpeg to stdin.
Alternative Tokens: Option to display competing token candidates when the model is uncertain.

Maintenance & Community

No specific details on maintainers, community channels, or roadmap were found in the provided README.

Licensing & Compatibility

The model weights are licensed under Apache-2.0. The C code itself is provided under the MIT License, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is explicitly stated to require "more testing" and may not be "production quality." Further work is needed, particularly for stress-testing with very long transcriptions to validate KV cache handling.

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

463 stars in the last 4 days