airllm  by lyogavin

Inference optimization for LLMs on low-resource hardware

created 2 years ago
5,883 stars

Top 8.9% on sourcepulse

GitHubView on GitHub
Project Summary

AirLLM enables running large language models, including 70B parameter models, on consumer hardware with as little as 4GB of VRAM, without requiring quantization or distillation. It targets researchers and power users needing to deploy LLMs on resource-constrained environments, offering significant memory optimization for inference.

How It Works

AirLLM achieves its low memory footprint by decomposing models into layers and loading them dynamically as needed. This "layer-wise" splitting, combined with optional block-wise quantization of weights (4-bit or 8-bit), reduces the peak memory requirement. Prefetching is also employed to overlap model loading with computation, improving inference speed.

Quick Start & Requirements

  • Install via pip: pip install airllm
  • Requires PyTorch and bitsandbytes for compression.
  • Supports MacOS (Apple Silicon) with MLX and PyTorch.
  • See Quick start and Example notebooks.

Highlighted Details

  • Supports Llama3.1 405B on 8GB VRAM.
  • Natively supports Llama3, Qwen2.5, Mixtral, ChatGLM, Baichuan, Mistral, and InternLM.
  • Offers up to 3x inference speed-up with block-wise quantization.
  • Can delete original model files to save disk space.

Maintenance & Community

  • Active development with frequent updates (e.g., Qwen2.5 support, CPU inference).
  • Open to contributions.

Licensing & Compatibility

  • The repository does not explicitly state a license. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The initial model decomposition process is disk-intensive and requires sufficient free space. Some models may require specific handling (e.g., providing a Hugging Face token for gated models or setting a padding token). CPU inference support was recently added.

Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
150 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Julien Chaumond Julien Chaumond(Cofounder of Hugging Face), and
1 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
created 4 years ago
updated 2 years ago
Feedback? Help us improve.