llama2.rs  by srush

Rust library for fast Llama2 inference on CPU

Created 2 years ago
1,052 stars

Top 35.8% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a high-performance Llama 2 inference engine implemented entirely in Rust, targeting CPU execution. It aims to offer speed comparable to or exceeding GPU-based solutions for researchers and power users needing efficient local LLM deployment.

How It Works

The engine leverages several key optimizations for CPU inference: 4-bit GPT-Q quantization for reduced memory and computation, batched prefill for prompt processing, and SIMD instructions for parallel computation. It also incorporates Grouped Query Attention (GQA) for larger models and memory mapping to load massive models like Llama 70B instantly.

Quick Start & Requirements

  • Install: Requires nightly Rust toolchain (rustup toolchain install nightly).
  • Build: Compile with specific features matching the model (e.g., cargo run --release --features 70B,group_64,quantized -- ...).
  • Dependencies: memmap2, rayon, clap, pyo3 (for Python API), portable_simd.
  • Model Loading: Models can be loaded from Hugging Face Hub using an export.py script (requires pip install -r requirements.export.txt).
  • Python API: Compile with --features python and install via pip install ..
  • Resource: ulimit -s 10000000 (increase stack memory limit).
  • Docs: [No explicit link provided in README]

Highlighted Details

  • Achieves ~0.89 tok/s for 70B Llama 2 and ~9 tok/s for 7B Llama 2 on an Intel i9.
  • Supports 4-bit GPT-Q quantization and Grouped Query Attention.
  • Memory mapping allows instant loading of large models.
  • Offers a Python calling API via pyo3.

Maintenance & Community

  • Developed by @srush and @rachtsingh.
  • Mentions potential integration with text-generation-webui.
  • TODO list includes GPU support (Triton), documentation, and safetensors support.
  • [No community links like Discord/Slack provided in README]

Licensing & Compatibility

  • License not explicitly stated in the README.

Limitations & Caveats

The project requires a nightly Rust toolchain and recompilation for different model configurations. GPU acceleration via Triton is listed as a future TODO, indicating current CPU-only operation. Support for safetensors is also pending.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
12 more.

mistral.rs by EricLBuehler

0.3%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 11 hours ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.7%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 months ago
Starred by Lianmin Zheng Lianmin Zheng(Coauthor of SGLang, vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

MiniCPM by OpenBMB

0.1%
8k
Ultra-efficient LLMs for end devices, achieving 5x+ speedup
Created 1 year ago
Updated 3 weeks ago
Feedback? Help us improve.