rllama by Noeda

Rust CLI tool for LLaMA model inference

Created 2 years ago

551 stars

Top 58.0% on SourcePulse

View on GitHub

2 Experts Love This Project

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Georgi Gerganov

Author of llama.cpp, whisper.cpp

Project Summary

RLLaMA is a pure Rust implementation for LLaMA model inference, targeting developers and researchers needing efficient LLM execution on diverse hardware. It offers significant performance gains through hand-optimized AVX2 instructions and OpenCL support for GPU acceleration, enabling hybrid CPU-GPU inference.

How It Works

This project leverages Rust's performance and safety features for LLM inference. It utilizes AVX2 intrinsics for optimized CPU computations and provides OpenCL integration for GPU acceleration. A key feature is the --percentage-to-gpu flag, allowing users to load only a portion of the model onto the GPU, facilitating inference on hardware with limited VRAM.

Quick Start & Requirements

Install via cargo install rllama with RUSTFLAGS="-C target-feature=+sse2,+avx,+fma,+avx2".
Requires LLaMA model weights and tokenizer.
Docker images are available for CPU and NVIDIA GPU setups.
See official documentation for detailed setup and model preparation.

Highlighted Details

Supports LLaMA-7B, 13B, 30B, and 65B models with f16 and f32 weights.
Offers a simple HTTP API for inference serving.
Includes an experimental interactive mode for chat-like interactions.
Benchmarks show competitive performance, e.g., LLaMA-7B at 216ms/token on RTX 3090 Ti via OpenCL.

Maintenance & Community

The project is described as a hobby, with no explicit mention of active maintenance or community channels.

Licensing & Compatibility

The project does not explicitly state a license in the provided README.

Limitations & Caveats

The author notes this is a hobby project, implying limited support and update frequency. Performance may be surpassed by libraries utilizing specific hardware features like NVIDIA Tensor Cores, which are not accessible via OpenCL. The interactive mode's output formatting is not yet polished.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days