voxtral-mini-realtime-rs  by TrevorS

Real-time speech recognition in Rust

Created 3 weeks ago

New!

661 stars

Top 50.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

This project provides a pure Rust implementation of Mistral's Voxtral Mini 4B Realtime model, enabling streaming speech recognition natively and directly within web browsers. It targets engineers and power users seeking efficient, client-side transcription capabilities, offering significant benefits by reducing reliance on server-side processing and enabling real-time applications across diverse platforms.

How It Works

The system leverages the Burn ML framework for its core operations, processing audio via a Mel spectrogram, followed by a causal encoder and an autoregressive decoder. It offers two inference paths: a full-precision F32 model using SafeTensors weights (~9 GB) for native execution, and a highly optimized Q4 GGUF quantized model (~2.5 GB) that runs efficiently on both native platforms and in the browser via WebAssembly (WASM) and WebGPU. The browser path utilizes custom WGSL shaders for fused dequantization and matrix multiplication, significantly reducing memory footprint and computational overhead.

Quick Start & Requirements

Native CLI installation involves downloading model weights (mistralai/Voxtral-Mini-4B-Realtime-2602) and compiling the Rust project with appropriate features (wgpu, cli, hub). Transcription can be performed on audio files using either the F32 model or the Q4 GGUF quantized version. For browser deployment, the WASM package must be built (wasm-pack), a self-signed certificate generated (openssl), and a development server started (bun serve.mjs). Access is via https://localhost:8443. Prerequisites include Rust, uv, wasm-pack, openssl, and bun. WebGPU support is mandatory for browser functionality. Model weights require ~9 GB (F32) or ~2.5 GB (Q4 GGUF).

Highlighted Details

  • Full Client-Side Browser Execution: The Q4 GGUF model (~2.5 GB) runs entirely within a browser tab using WASM and WebGPU, achieving significant performance gains.
  • WASM Constraints Solved: Overcame browser limitations including 2 GB allocation, 4 GB address space, large embedding tables (1.5 GiB), synchronous GPU readback, and workgroup invocation limits through techniques like sharded loading, two-phase loading, Q4 GPU embeddings, async data transfer, and patched GPU kernels.
  • Optimized Q4 GGUF Inference: Employs custom WGSL shaders for fused dequantization and matrix multiplication, enabling efficient inference on quantized weights.
  • Q4 Padding Workaround: Addresses quantization sensitivity to initial speech content by increasing audio padding to 76 tokens, ensuring correct decoder prefix handling.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or project roadmap are provided within the README.

Licensing & Compatibility

The project is licensed under the Apache-2.0 license, which is permissive and generally suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

Performance benchmarks (accuracy WER, inference speed) are pending. GPU-dependent tests are not run in CI due to runner limitations. Browser deployment requires manual GGUF file sharding into 512 MB chunks and generating a self-signed certificate for secure context.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
3
Star History
665 stars in the last 21 days

Explore Similar Projects

Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
3 more.

voxtral.c by antirez

5.3%
1k
Pure C speech-to-text inference engine for Mistral Voxtral Realtime 4B
Created 2 weeks ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

gemma_pytorch by google

0.1%
6k
PyTorch implementation for Google's Gemma models
Created 2 years ago
Updated 9 months ago
Feedback? Help us improve.