airllm by lyogavin

Inference optimization for LLMs on low-resource hardware

Created 2 years ago

6,817 stars

Top 7.4% on SourcePulse

View on GitHub

2 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Project Summary

AirLLM enables running large language models, including 70B parameter models, on consumer hardware with as little as 4GB of VRAM, without requiring quantization or distillation. It targets researchers and power users needing to deploy LLMs on resource-constrained environments, offering significant memory optimization for inference.

How It Works

AirLLM achieves its low memory footprint by decomposing models into layers and loading them dynamically as needed. This "layer-wise" splitting, combined with optional block-wise quantization of weights (4-bit or 8-bit), reduces the peak memory requirement. Prefetching is also employed to overlap model loading with computation, improving inference speed.

Quick Start & Requirements

Install via pip: pip install airllm
Requires PyTorch and bitsandbytes for compression.
Supports MacOS (Apple Silicon) with MLX and PyTorch.
See Quick start and Example notebooks.

Highlighted Details

Supports Llama3.1 405B on 8GB VRAM.
Natively supports Llama3, Qwen2.5, Mixtral, ChatGLM, Baichuan, Mistral, and InternLM.
Offers up to 3x inference speed-up with block-wise quantization.
Can delete original model files to save disk space.

Maintenance & Community

Active development with frequent updates (e.g., Qwen2.5 support, CPU inference).
Open to contributions.

Licensing & Compatibility

The repository does not explicitly state a license. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The initial model decomposition process is disk-intensive and requires sufficient free space. Some models may require specific handling (e.g., providing a Hugging Face token for gated models or setting a padding token). CPU inference support was recently added.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

472 stars in the last 30 days