Rust library for fast Llama2 inference on CPU
Top 36.4% on sourcepulse
This project provides a high-performance Llama 2 inference engine implemented entirely in Rust, targeting CPU execution. It aims to offer speed comparable to or exceeding GPU-based solutions for researchers and power users needing efficient local LLM deployment.
How It Works
The engine leverages several key optimizations for CPU inference: 4-bit GPT-Q quantization for reduced memory and computation, batched prefill for prompt processing, and SIMD instructions for parallel computation. It also incorporates Grouped Query Attention (GQA) for larger models and memory mapping to load massive models like Llama 70B instantly.
Quick Start & Requirements
rustup toolchain install nightly
).cargo run --release --features 70B,group_64,quantized -- ...
).memmap2
, rayon
, clap
, pyo3
(for Python API), portable_simd
.export.py
script (requires pip install -r requirements.export.txt
).--features python
and install via pip install .
.ulimit -s 10000000
(increase stack memory limit).Highlighted Details
pyo3
.Maintenance & Community
text-generation-webui
.Licensing & Compatibility
Limitations & Caveats
The project requires a nightly Rust toolchain and recompilation for different model configurations. GPU acceleration via Triton is listed as a future TODO, indicating current CPU-only operation. Support for safetensors is also pending.
1 year ago
1 day