MagicPIG by Infini-AI-Lab

Efficient LLM generation via LSH sampling

Created 1 year ago

254 stars

Top 99.1% on SourcePulse

Project Summary

MagicPIG addresses the challenge of efficient Large Language Model (LLM) generation by introducing Locality-Sensitive Hashing (LSH) sampling. This technique enables a hybrid GPU-CPU system, significantly boosting decoding throughput and improving downstream task accuracy compared to GPU-only approaches. It is designed for researchers and power users seeking to optimize LLM inference performance and explore novel hardware utilization strategies.

How It Works

The project leverages LSH sampling to approximate the attention mechanism in LLMs, drastically reducing the computational load. By intelligently sampling token pairs, MagicPIG offloads parts of the attention computation to the CPU, creating a synergistic GPU-CPU architecture. This approach is advantageous as it minimizes the need for extensive GPU VRAM and achieves higher accuracy on retrieval and reasoning tasks than state-of-the-art baselines like Quest, even at a fraction of the computational cost.

Quick Start & Requirements

Primary install / run command: Requires Conda environment setup (conda create -n magicpig, conda activate magicpig) followed by bash install.sh.
Non-default prerequisites and dependencies: Basic functionality requires Intel CPUs supporting AVX512. BFloat16 support necessitates AVX512_BF16 and GCC version $\geq$ 11. Recommended Python versions are 3.9/3.10. Currently limited to Llama models (e.g., meta-llama/Meta-Llama-3.1-8B-Instruct).
Links: [Paper] and [Blog] are provided.

Highlighted Details

Achieves 1.76-4.99x improvement in decoding throughput over GPU-only attention.
Demonstrates higher downstream accuracy in retrieval and reasoning tasks than the Quest baseline.
Recent updates (December 2024) include integration with FlashInfer for GPU attention, enhanced CPU sparse attention, and optimized hash table construction.

Maintenance & Community

No specific details on contributors, sponsorships, or community channels were found in the provided README snippet.

Licensing & Compatibility

No explicit license information or compatibility notes for commercial use were found in the provided README snippet.

Limitations & Caveats

The core performance benefits are tied to specific hardware, requiring Intel CPUs with AVX512 support. Current model support is restricted to Llama architectures. While accuracy can be evaluated on non-AVX512 systems via equivalent implementations, latency and throughput benchmarks are hardware-dependent.

MagicPIG by Infini-AI-Lab

Explore Similar Projects

native-sparse-attention-triton by XunhaoLai

ffpa-attn by xlite-dev

simple-llm by naklecha

MSA by MiniMax-AI

InfLLM by thunlp

duo-attention by mit-han-lab

Quest by mit-han-lab

Block-Sparse-Attention by mit-han-lab

omniserve by mit-han-lab

Kimi-Linear by MoonshotAI

LookaheadDecoding by hao-ai-lab

MSA by EverMind-AI