voxtral-mini-realtime-rs by TrevorS

Real-time speech recognition in Rust

Created 5 months ago

806 stars

Top 43.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Dan Guido

Cofounder of Trail of Bits

Project Summary

Summary

This project provides a pure Rust implementation of Mistral's Voxtral Mini 4B Realtime model, enabling streaming speech recognition natively and directly within web browsers. It targets engineers and power users seeking efficient, client-side transcription capabilities, offering significant benefits by reducing reliance on server-side processing and enabling real-time applications across diverse platforms.

How It Works

The system leverages the Burn ML framework for its core operations, processing audio via a Mel spectrogram, followed by a causal encoder and an autoregressive decoder. It offers two inference paths: a full-precision F32 model using SafeTensors weights (~9 GB) for native execution, and a highly optimized Q4 GGUF quantized model (~2.5 GB) that runs efficiently on both native platforms and in the browser via WebAssembly (WASM) and WebGPU. The browser path utilizes custom WGSL shaders for fused dequantization and matrix multiplication, significantly reducing memory footprint and computational overhead.

Quick Start & Requirements

Native CLI installation involves downloading model weights (mistralai/Voxtral-Mini-4B-Realtime-2602) and compiling the Rust project with appropriate features (wgpu, cli, hub). Transcription can be performed on audio files using either the F32 model or the Q4 GGUF quantized version. For browser deployment, the WASM package must be built (wasm-pack), a self-signed certificate generated (openssl), and a development server started (bun serve.mjs). Access is via https://localhost:8443. Prerequisites include Rust, uv, wasm-pack, openssl, and bun. WebGPU support is mandatory for browser functionality. Model weights require ~9 GB (F32) or ~2.5 GB (Q4 GGUF).

Highlighted Details

Full Client-Side Browser Execution: The Q4 GGUF model (~2.5 GB) runs entirely within a browser tab using WASM and WebGPU, achieving significant performance gains.
WASM Constraints Solved: Overcame browser limitations including 2 GB allocation, 4 GB address space, large embedding tables (1.5 GiB), synchronous GPU readback, and workgroup invocation limits through techniques like sharded loading, two-phase loading, Q4 GPU embeddings, async data transfer, and patched GPU kernels.
Optimized Q4 GGUF Inference: Employs custom WGSL shaders for fused dequantization and matrix multiplication, enabling efficient inference on quantized weights.
Q4 Padding Workaround: Addresses quantization sensitivity to initial speech content by increasing audio padding to 76 tokens, ensuring correct decoder prefix handling.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or project roadmap are provided within the README.

Licensing & Compatibility

The project is licensed under the Apache-2.0 license, which is permissive and generally suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

Performance benchmarks (accuracy WER, inference speed) are pending. GPU-dependent tests are not run in CI due to runner limitations. Browser deployment requires manual GGUF file sharding into 512 MB chunks and generating a self-signed certificate for secure context.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

13 stars in the last 30 days