MagicPIG  by Infini-AI-Lab

Efficient LLM generation via LSH sampling

Created 1 year ago
250 stars

Top 100.0% on SourcePulse

GitHubView on GitHub
Project Summary

MagicPIG addresses the challenge of efficient Large Language Model (LLM) generation by introducing Locality-Sensitive Hashing (LSH) sampling. This technique enables a hybrid GPU-CPU system, significantly boosting decoding throughput and improving downstream task accuracy compared to GPU-only approaches. It is designed for researchers and power users seeking to optimize LLM inference performance and explore novel hardware utilization strategies.

How It Works

The project leverages LSH sampling to approximate the attention mechanism in LLMs, drastically reducing the computational load. By intelligently sampling token pairs, MagicPIG offloads parts of the attention computation to the CPU, creating a synergistic GPU-CPU architecture. This approach is advantageous as it minimizes the need for extensive GPU VRAM and achieves higher accuracy on retrieval and reasoning tasks than state-of-the-art baselines like Quest, even at a fraction of the computational cost.

Quick Start & Requirements

  • Primary install / run command: Requires Conda environment setup (conda create -n magicpig, conda activate magicpig) followed by bash install.sh.
  • Non-default prerequisites and dependencies: Basic functionality requires Intel CPUs supporting AVX512. BFloat16 support necessitates AVX512_BF16 and GCC version $\geq$ 11. Recommended Python versions are 3.9/3.10. Currently limited to Llama models (e.g., meta-llama/Meta-Llama-3.1-8B-Instruct).
  • Links: [Paper] and [Blog] are provided.

Highlighted Details

  • Achieves 1.76-4.99x improvement in decoding throughput over GPU-only attention.
  • Demonstrates higher downstream accuracy in retrieval and reasoning tasks than the Quest baseline.
  • Recent updates (December 2024) include integration with FlashInfer for GPU attention, enhanced CPU sparse attention, and optimized hash table construction.

Maintenance & Community

No specific details on contributors, sponsorships, or community channels were found in the provided README snippet.

Licensing & Compatibility

No explicit license information or compatibility notes for commercial use were found in the provided README snippet.

Limitations & Caveats

The core performance benefits are tied to specific hardware, requiring Intel CPUs with AVX512 support. Current model support is restricted to Llama architectures. While accuracy can be evaluated on non-AVX512 systems via equivalent implementations, latency and throughput benchmarks are hardware-dependent.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0%
1k
Parallel decoding algorithm for faster LLM inference
Created 2 years ago
Updated 11 months ago
Feedback? Help us improve.