ollm  by Mega4alik

Large-context LLM inference on consumer hardware

Created 1 month ago
1,886 stars

Top 22.9% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

oLLM is a Python library designed for efficient, large-context Large Language Model (LLM) inference on consumer-grade hardware. It targets researchers and power users needing to process extensive documents or conversations locally, enabling models like Llama-3.1-8B-Instruct and qwen3-next-80B to operate with contexts up to 100k tokens on GPUs with as little as 8GB VRAM, without resorting to quantization.

How It Works

The library employs several techniques to manage memory constraints. It loads model layer weights directly from SSD to the GPU sequentially and offloads the KV cache to SSD, minimizing VRAM usage. Further optimization comes from using FlashAttention-2 with online softmax, which avoids materializing the full attention matrix. Additionally, Chunked MLP layers are implemented to reduce memory overhead for intermediate computations. This approach prioritizes fp16/bf16 precision over quantization to maintain model accuracy.

Quick Start & Requirements

Installation is straightforward via pip: pip install ollm. For specific models like qwen3-next, a development version of Huggingface Transformers may be required (pip install git+https://github.com/huggingface/transformers.git). A key dependency is kvikio-cu{cuda_version} (e.g., kvikio-cu12), indicating a requirement for a compatible Nvidia GPU and CUDA toolkit (Ampere, Ada Lovelace, Hopper, or newer). An SSD is essential for the offloading strategies to function effectively. Setup involves creating a Python virtual environment and installing the package. The official GitHub repository is https://github.com/Mega4alik/ollm.git.

Highlighted Details

  • Supports large context windows, including 100k tokens for Llama-3 models.
  • Enables inference for large models (e.g., qwen3-next-80B, gpt-oss-20B) on 8GB VRAM consumer GPUs.
  • Utilizes SSD offloading for weights and KV cache, and FlashAttention-2 for memory efficiency.
  • Latest updates (v0.4.0) include optimized qwen3-next-80B with high throughput (1 tok/2s), replaced Llama3 attention with flash-attention2, and introduced DiskCache.
  • Detailed VRAM usage tables are provided, showing significant reductions compared to baseline inference.

Maintenance & Community

The project is maintained by Mega4alik. For requests regarding model support, users can contact anuarsh@ailabs.us. No specific community channels like Discord or Slack are mentioned in the README.

Licensing & Compatibility

The license type is not explicitly stated in the provided README. Compatibility is primarily focused on Nvidia GPUs with specific CUDA versions.

Limitations & Caveats

The project is strictly limited to Nvidia GPUs. Its performance heavily relies on the speed of the underlying SSD for offloading operations. The absence of a specified license poses a significant adoption blocker, especially for commercial use. A specific development version of the transformers library is noted as a requirement for certain models.

Health Check
Last Commit

18 hours ago

Responsiveness

Inactive

Pull Requests (30d)
6
Issues (30d)
16
Star History
1,821 stars in the last 30 days

Explore Similar Projects

Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 1 month ago
Feedback? Help us improve.