Discover and explore top open-source AI tools and projects—updated daily.
Large-context LLM inference on consumer hardware
Top 22.9% on SourcePulse
Summary
oLLM is a Python library designed for efficient, large-context Large Language Model (LLM) inference on consumer-grade hardware. It targets researchers and power users needing to process extensive documents or conversations locally, enabling models like Llama-3.1-8B-Instruct and qwen3-next-80B to operate with contexts up to 100k tokens on GPUs with as little as 8GB VRAM, without resorting to quantization.
How It Works
The library employs several techniques to manage memory constraints. It loads model layer weights directly from SSD to the GPU sequentially and offloads the KV cache to SSD, minimizing VRAM usage. Further optimization comes from using FlashAttention-2 with online softmax, which avoids materializing the full attention matrix. Additionally, Chunked MLP layers are implemented to reduce memory overhead for intermediate computations. This approach prioritizes fp16/bf16 precision over quantization to maintain model accuracy.
Quick Start & Requirements
Installation is straightforward via pip: pip install ollm
. For specific models like qwen3-next, a development version of Huggingface Transformers may be required (pip install git+https://github.com/huggingface/transformers.git
). A key dependency is kvikio-cu{cuda_version}
(e.g., kvikio-cu12
), indicating a requirement for a compatible Nvidia GPU and CUDA toolkit (Ampere, Ada Lovelace, Hopper, or newer). An SSD is essential for the offloading strategies to function effectively. Setup involves creating a Python virtual environment and installing the package. The official GitHub repository is https://github.com/Mega4alik/ollm.git
.
Highlighted Details
Maintenance & Community
The project is maintained by Mega4alik. For requests regarding model support, users can contact anuarsh@ailabs.us
. No specific community channels like Discord or Slack are mentioned in the README.
Licensing & Compatibility
The license type is not explicitly stated in the provided README. Compatibility is primarily focused on Nvidia GPUs with specific CUDA versions.
Limitations & Caveats
The project is strictly limited to Nvidia GPUs. Its performance heavily relies on the speed of the underlying SSD for offloading operations. The absence of a specified license poses a significant adoption blocker, especially for commercial use. A specific development version of the transformers
library is noted as a requirement for certain models.
18 hours ago
Inactive