llama2.rs  by srush

Rust library for fast Llama2 inference on CPU

created 2 years ago
1,053 stars

Top 36.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides a high-performance Llama 2 inference engine implemented entirely in Rust, targeting CPU execution. It aims to offer speed comparable to or exceeding GPU-based solutions for researchers and power users needing efficient local LLM deployment.

How It Works

The engine leverages several key optimizations for CPU inference: 4-bit GPT-Q quantization for reduced memory and computation, batched prefill for prompt processing, and SIMD instructions for parallel computation. It also incorporates Grouped Query Attention (GQA) for larger models and memory mapping to load massive models like Llama 70B instantly.

Quick Start & Requirements

  • Install: Requires nightly Rust toolchain (rustup toolchain install nightly).
  • Build: Compile with specific features matching the model (e.g., cargo run --release --features 70B,group_64,quantized -- ...).
  • Dependencies: memmap2, rayon, clap, pyo3 (for Python API), portable_simd.
  • Model Loading: Models can be loaded from Hugging Face Hub using an export.py script (requires pip install -r requirements.export.txt).
  • Python API: Compile with --features python and install via pip install ..
  • Resource: ulimit -s 10000000 (increase stack memory limit).
  • Docs: [No explicit link provided in README]

Highlighted Details

  • Achieves ~0.89 tok/s for 70B Llama 2 and ~9 tok/s for 7B Llama 2 on an Intel i9.
  • Supports 4-bit GPT-Q quantization and Grouped Query Attention.
  • Memory mapping allows instant loading of large models.
  • Offers a Python calling API via pyo3.

Maintenance & Community

  • Developed by @srush and @rachtsingh.
  • Mentions potential integration with text-generation-webui.
  • TODO list includes GPU support (Triton), documentation, and safetensors support.
  • [No community links like Discord/Slack provided in README]

Licensing & Compatibility

  • License not explicitly stated in the README.

Limitations & Caveats

The project requires a nightly Rust toolchain and recompilation for different model configurations. GPU acceleration via Triton is listed as a future TODO, indicating current CPU-only operation. Support for safetensors is also pending.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

JittorLLMs by Jittor

0%
2k
Low-resource LLM inference library
created 2 years ago
updated 5 months ago
Feedback? Help us improve.