Minimal LLM inference in Rust
Top 37.7% on sourcepulse
This project provides a minimal, CPU-bound inference engine for large language models (LLMs) written in Rust. It targets developers and researchers seeking to run LLMs locally without heavy ML dependencies, offering support for Gemma, Llama 3.2, and PHI-3.5 (including multimodal capabilities) with quantized models for improved performance.
How It Works
The engine implements LLM inference directly in Rust, avoiding external ML libraries like PyTorch or TensorFlow. It leverages custom model conversion scripts to transform Hugging Face models into its own .lmrs
format, supporting various quantization levels (e.g., Q8_0, Q4_0) for reduced memory footprint and faster inference. The core design prioritizes minimal dependencies and direct CPU execution, inspired by projects like llama2.c
.
Quick Start & Requirements
pip install -r requirements.txt
python export.py
and python tokenizer.py
.RUSTFLAGS="-C target-cpu=native" cargo build --release [--features multimodal]
./target/release/chat --model [model weights file]
--features backend
and run ./target/release/backend
..safetensors
, config.json
, CLIP config for vision).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is presented as an experimental learning exercise by the author, with some code potentially requiring optimization. Support for larger models (e.g., 27B) is noted as too slow for practical use on the author's hardware.
9 months ago
1 day