rllama  by Noeda

Rust CLI tool for LLaMA model inference

created 2 years ago
549 stars

Top 59.0% on sourcepulse

GitHubView on GitHub
Project Summary

RLLaMA is a pure Rust implementation for LLaMA model inference, targeting developers and researchers needing efficient LLM execution on diverse hardware. It offers significant performance gains through hand-optimized AVX2 instructions and OpenCL support for GPU acceleration, enabling hybrid CPU-GPU inference.

How It Works

This project leverages Rust's performance and safety features for LLM inference. It utilizes AVX2 intrinsics for optimized CPU computations and provides OpenCL integration for GPU acceleration. A key feature is the --percentage-to-gpu flag, allowing users to load only a portion of the model onto the GPU, facilitating inference on hardware with limited VRAM.

Quick Start & Requirements

  • Install via cargo install rllama with RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2".
  • Requires LLaMA model weights and tokenizer.
  • Docker images are available for CPU and NVIDIA GPU setups.
  • See official documentation for detailed setup and model preparation.

Highlighted Details

  • Supports LLaMA-7B, 13B, 30B, and 65B models with f16 and f32 weights.
  • Offers a simple HTTP API for inference serving.
  • Includes an experimental interactive mode for chat-like interactions.
  • Benchmarks show competitive performance, e.g., LLaMA-7B at 216ms/token on RTX 3090 Ti via OpenCL.

Maintenance & Community

The project is described as a hobby, with no explicit mention of active maintenance or community channels.

Licensing & Compatibility

The project does not explicitly state a license in the provided README.

Limitations & Caveats

The author notes this is a hobby project, implying limited support and update frequency. Performance may be surpassed by libraries utilizing specific hardware features like NVIDIA Tensor Cores, which are not accessible via OpenCL. The interactive mode's output formatting is not yet polished.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 14 hours ago
Feedback? Help us improve.