airllm  by lyogavin

Inference optimization for LLMs on low-resource hardware

Created 2 years ago
5,920 stars

Top 8.7% on SourcePulse

GitHubView on GitHub
Project Summary

AirLLM enables running large language models, including 70B parameter models, on consumer hardware with as little as 4GB of VRAM, without requiring quantization or distillation. It targets researchers and power users needing to deploy LLMs on resource-constrained environments, offering significant memory optimization for inference.

How It Works

AirLLM achieves its low memory footprint by decomposing models into layers and loading them dynamically as needed. This "layer-wise" splitting, combined with optional block-wise quantization of weights (4-bit or 8-bit), reduces the peak memory requirement. Prefetching is also employed to overlap model loading with computation, improving inference speed.

Quick Start & Requirements

  • Install via pip: pip install airllm
  • Requires PyTorch and bitsandbytes for compression.
  • Supports MacOS (Apple Silicon) with MLX and PyTorch.
  • See Quick start and Example notebooks.

Highlighted Details

  • Supports Llama3.1 405B on 8GB VRAM.
  • Natively supports Llama3, Qwen2.5, Mixtral, ChatGLM, Baichuan, Mistral, and InternLM.
  • Offers up to 3x inference speed-up with block-wise quantization.
  • Can delete original model files to save disk space.

Maintenance & Community

  • Active development with frequent updates (e.g., Qwen2.5 support, CPU inference).
  • Open to contributions.

Licensing & Compatibility

  • The repository does not explicitly state a license. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The initial model decomposition process is disk-intensive and requires sufficient free space. Some models may require specific handling (e.g., providing a Hugging Face token for gated models or setting a padding token). CPU inference support was recently added.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
26 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

llm-awq by mit-han-lab

0.3%
3k
Weight quantization research paper for LLM compression/acceleration
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.