optiml by NU-QRG

Accelerate LLM agents on consumer hardware

Created 5 months ago

509 stars

Top 61.4% on SourcePulse

Project Summary

OptiML is an acceleration library designed to enable high-speed Large Language Model (LLM) inference on consumer-grade hardware by intelligently distributing computation between the CPU and GPU. It targets users who want to run large models locally without requiring datacenter-class GPUs, offering significant speedups and reduced VRAM requirements.

How It Works

OptiML leverages the principle of "activation locality," observing that a small subset of "hot" neurons are frequently activated across inputs, while the majority of "cold" neurons are input-dependent. It pins the hot neurons and their weights to the GPU for fast reuse and offloads the computation of cold neurons to the CPU. This hybrid approach, combined with quantization, balances latency, throughput, and memory usage, allowing larger models to run efficiently on commodity PCs.

Quick Start & Requirements

Installation: Build from source using CMake. Python bindings can be installed via pip.
Prerequisites: Consumer GPU (NVIDIA/AMD/Apple Silicon) with recent drivers, modern CPU with AVX2 (or Apple Silicon), CMake ≥ 3.20, C/C++ toolchain, Python 3.9+.
Model Preparation: Requires models in GGUF format; includes a tool for quantization (e.g., to Q4_K).
Resources: Building from source and preparing models may take time depending on hardware.
Links: GitHub Repository, Docker Hub

Highlighted Details

Achieves up to 2.7x speedup compared to llama.cpp on consumer hardware.
Supports hybrid CPU/GPU execution for reduced VRAM pressure.
Offers a CLI, Python API, and HTTP demo server for easy deployment.
Works with decoder-only transformer families, particularly LLaMA variants in GGUF format.

Maintenance & Community

The project is initiated at Northwestern University's QRG lab. Links to X (Twitter) and GitHub stars are provided, indicating community interest. A roadmap outlines plans for broader model support, new quantization modes, and extended demos.

Licensing & Compatibility

The project is licensed under the MIT license, permitting commercial use and integration with closed-source applications.

Limitations & Caveats

The Python API is noted as being in an early stage with potential bugs. Current model support is primarily for Llama 2 and Llama 3, with plans to expand coverage.

optiml by NU-QRG

Explore Similar Projects

FastFlowLM by FastFlowLM

TileRT by tile-ai

fiddler by efeslab

flashtensors by leoheuler

glake by antgroup

omniserve by mit-han-lab

gpu_poor by RahulSChand

chitu by thu-pacman

rtp-llm by alibaba

ollm by Mega4alik

fastllm by ztxz16

PowerInfer by SJTU-IPADS