Discover and explore top open-source AI tools and projects—updated daily.
samuel-vitorinoMinimal LLM inference in Rust
Top 36.8% on SourcePulse
This project provides a minimal, CPU-bound inference engine for large language models (LLMs) written in Rust. It targets developers and researchers seeking to run LLMs locally without heavy ML dependencies, offering support for Gemma, Llama 3.2, and PHI-3.5 (including multimodal capabilities) with quantized models for improved performance.
How It Works
The engine implements LLM inference directly in Rust, avoiding external ML libraries like PyTorch or TensorFlow. It leverages custom model conversion scripts to transform Hugging Face models into its own .lmrs format, supporting various quantization levels (e.g., Q8_0, Q4_0) for reduced memory footprint and faster inference. The core design prioritizes minimal dependencies and direct CPU execution, inspired by projects like llama2.c.
Quick Start & Requirements
pip install -r requirements.txtpython export.py and python tokenizer.py.RUSTFLAGS="-C target-cpu=native" cargo build --release [--features multimodal]./target/release/chat --model [model weights file]--features backend and run ./target/release/backend..safetensors, config.json, CLIP config for vision).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is presented as an experimental learning exercise by the author, with some code potentially requiring optimization. Support for larger models (e.g., 27B) is noted as too slow for practical use on the author's hardware.
1 year ago
Inactive
Vahe1994
huggingface
vllm-project
intel
lyogavin
AutoGPTQ
google