Inference optimization for Mixtral-8x7B models
Top 20.1% on sourcepulse
This project enables efficient inference of Mixtral-8x7B models on consumer hardware like Colab or desktops by offloading model experts between GPU and CPU memory. It targets researchers and developers working with large language models who need to run them on resource-constrained environments.
How It Works
The core approach combines mixed quantization using HQQ and a Mixture-of-Experts (MoE) offloading strategy. Different quantization schemes are applied to attention layers and experts to minimize memory footprint. Experts are offloaded individually and brought back to the GPU only when required, utilizing an LRU cache for active experts to reduce GPU-CPU communication during activation computation.
Quick Start & Requirements
./notebooks/demo.ipynb
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Some techniques described in the project's technical report are not yet implemented in the repository. The project is a work in progress, and a command-line interface for local execution is not yet available.
1 year ago
1 day