llama2.rs by srush

Rust library for fast Llama2 inference on CPU

Created 2 years ago

1,057 stars

Top 35.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Benjamin Bolte

Cofounder of K-Scale Labs

Project Summary

This project provides a high-performance Llama 2 inference engine implemented entirely in Rust, targeting CPU execution. It aims to offer speed comparable to or exceeding GPU-based solutions for researchers and power users needing efficient local LLM deployment.

How It Works

The engine leverages several key optimizations for CPU inference: 4-bit GPT-Q quantization for reduced memory and computation, batched prefill for prompt processing, and SIMD instructions for parallel computation. It also incorporates Grouped Query Attention (GQA) for larger models and memory mapping to load massive models like Llama 70B instantly.

Quick Start & Requirements

Install: Requires nightly Rust toolchain (rustup toolchain install nightly).
Build: Compile with specific features matching the model (e.g., cargo run --release --features 70B,group_64,quantized -- ...).
Dependencies: memmap2, rayon, clap, pyo3 (for Python API), portable_simd.
Model Loading: Models can be loaded from Hugging Face Hub using an export.py script (requires pip install -r requirements.export.txt).
Python API: Compile with --features python and install via pip install ..
Resource: ulimit -s 10000000 (increase stack memory limit).
Docs: [No explicit link provided in README]

Highlighted Details

Achieves ~0.89 tok/s for 70B Llama 2 and ~9 tok/s for 7B Llama 2 on an Intel i9.
Supports 4-bit GPT-Q quantization and Grouped Query Attention.
Memory mapping allows instant loading of large models.
Offers a Python calling API via pyo3.

Maintenance & Community

Developed by @srush and @rachtsingh.
Mentions potential integration with text-generation-webui.
TODO list includes GPU support (Triton), documentation, and safetensors support.
[No community links like Discord/Slack provided in README]

Licensing & Compatibility

License not explicitly stated in the README.

Limitations & Caveats

The project requires a nightly Rust toolchain and recompilation for different model configurations. GPU acceleration via Triton is listed as a future TODO, indicating current CPU-only operation. Support for safetensors is also pending.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days