optiml  by NU-QRG

Accelerate LLM agents on consumer hardware

Created 1 month ago
509 stars

Top 61.4% on SourcePulse

GitHubView on GitHub
Project Summary

OptiML is an acceleration library designed to enable high-speed Large Language Model (LLM) inference on consumer-grade hardware by intelligently distributing computation between the CPU and GPU. It targets users who want to run large models locally without requiring datacenter-class GPUs, offering significant speedups and reduced VRAM requirements.

How It Works

OptiML leverages the principle of "activation locality," observing that a small subset of "hot" neurons are frequently activated across inputs, while the majority of "cold" neurons are input-dependent. It pins the hot neurons and their weights to the GPU for fast reuse and offloads the computation of cold neurons to the CPU. This hybrid approach, combined with quantization, balances latency, throughput, and memory usage, allowing larger models to run efficiently on commodity PCs.

Quick Start & Requirements

  • Installation: Build from source using CMake. Python bindings can be installed via pip.
  • Prerequisites: Consumer GPU (NVIDIA/AMD/Apple Silicon) with recent drivers, modern CPU with AVX2 (or Apple Silicon), CMake ≥ 3.20, C/C++ toolchain, Python 3.9+.
  • Model Preparation: Requires models in GGUF format; includes a tool for quantization (e.g., to Q4_K).
  • Resources: Building from source and preparing models may take time depending on hardware.
  • Links: GitHub Repository, Docker Hub

Highlighted Details

  • Achieves up to 2.7x speedup compared to llama.cpp on consumer hardware.
  • Supports hybrid CPU/GPU execution for reduced VRAM pressure.
  • Offers a CLI, Python API, and HTTP demo server for easy deployment.
  • Works with decoder-only transformer families, particularly LLaMA variants in GGUF format.

Maintenance & Community

The project is initiated at Northwestern University's QRG lab. Links to X (Twitter) and GitHub stars are provided, indicating community interest. A roadmap outlines plans for broader model support, new quantization modes, and extended demos.

Licensing & Compatibility

The project is licensed under the MIT license, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The Python API is noted as being in an early stage with potential bugs. Current model support is primarily for Llama 2 and Llama 3, with plans to expand coverage.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
58 more.

vllm by vllm-project

1.1%
58k
LLM serving engine for high-throughput, memory-efficient inference
Created 2 years ago
Updated 15 hours ago
Feedback? Help us improve.