Discover and explore top open-source AI tools and projects—updated daily.
Accelerate LLM agents on consumer hardware
Top 61.4% on SourcePulse
OptiML is an acceleration library designed to enable high-speed Large Language Model (LLM) inference on consumer-grade hardware by intelligently distributing computation between the CPU and GPU. It targets users who want to run large models locally without requiring datacenter-class GPUs, offering significant speedups and reduced VRAM requirements.
How It Works
OptiML leverages the principle of "activation locality," observing that a small subset of "hot" neurons are frequently activated across inputs, while the majority of "cold" neurons are input-dependent. It pins the hot neurons and their weights to the GPU for fast reuse and offloads the computation of cold neurons to the CPU. This hybrid approach, combined with quantization, balances latency, throughput, and memory usage, allowing larger models to run efficiently on commodity PCs.
Quick Start & Requirements
pip
.Highlighted Details
Maintenance & Community
The project is initiated at Northwestern University's QRG lab. Links to X (Twitter) and GitHub stars are provided, indicating community interest. A roadmap outlines plans for broader model support, new quantization modes, and extended demos.
Licensing & Compatibility
The project is licensed under the MIT license, permitting commercial use and integration with closed-source applications.
Limitations & Caveats
The Python API is noted as being in an early stage with potential bugs. Current model support is primarily for Llama 2 and Llama 3, with plans to expand coverage.
1 month ago
Inactive