mixtral-offloading  by dvmazur

Inference optimization for Mixtral-8x7B models

Created 1 year ago
2,323 stars

Top 19.6% on SourcePulse

GitHubView on GitHub
Project Summary

This project enables efficient inference of Mixtral-8x7B models on consumer hardware like Colab or desktops by offloading model experts between GPU and CPU memory. It targets researchers and developers working with large language models who need to run them on resource-constrained environments.

How It Works

The core approach combines mixed quantization using HQQ and a Mixture-of-Experts (MoE) offloading strategy. Different quantization schemes are applied to attention layers and experts to minimize memory footprint. Experts are offloaded individually and brought back to the GPU only when required, utilizing an LRU cache for active experts to reduce GPU-CPU communication during activation computation.

Quick Start & Requirements

  • Run the demo notebook: ./notebooks/demo.ipynb
  • No command-line script is currently available; the demo notebook serves as a reference for local setup.
  • Requires Python and relevant ML libraries (specific versions not detailed in README).

Highlighted Details

  • Efficient inference of Mixtral-8x7B on consumer hardware.
  • Combines HQQ quantization with MoE offloading.
  • Utilizes an LRU cache for active experts to optimize GPU-CPU communication.

Maintenance & Community

  • Actively under development with plans to add support for more quantization methods and speculative expert prefetching.
  • Contributions are welcomed.

Licensing & Compatibility

  • License type not specified in the README.
  • Compatibility for commercial or closed-source use is not detailed.

Limitations & Caveats

Some techniques described in the project's technical report are not yet implemented in the repository. The project is a work in progress, and a command-line interface for local execution is not yet available.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dots.llm1 by rednote-hilab

0.2%
462
MoE model for research
Created 4 months ago
Updated 4 weeks ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
11 more.

mistral.rs by EricLBuehler

0.3%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 1 day ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
4 more.

ktransformers by kvcache-ai

0.3%
15k
Framework for LLM inference optimization experimentation
Created 1 year ago
Updated 2 days ago
Feedback? Help us improve.