PowerInfer  by SJTU-IPADS

LLM inference engine for local deployment on consumer GPUs

created 1 year ago
8,280 stars

Top 6.4% on sourcepulse

GitHubView on GitHub
Project Summary

PowerInfer is a high-speed LLM inference engine designed for local deployment on consumer-grade hardware, targeting researchers and power users seeking efficient LLM serving. It significantly accelerates inference by leveraging activation locality, enabling performance close to server-grade GPUs on single consumer GPUs.

How It Works

PowerInfer utilizes a CPU-GPU hybrid approach based on the observation that LLM inference exhibits activation locality, with a small subset of "hot" neurons consistently activated. Hot neurons are preloaded onto the GPU for fast access, while less frequently activated "cold" neurons are computed on the CPU. This strategy reduces GPU memory requirements and CPU-GPU data transfers. The engine further incorporates adaptive predictors and neuron-aware sparse operators to optimize activation efficiency and computational sparsity.

Quick Start & Requirements

  • Installation: Clone the repository, install Python dependencies (pip install -r requirements.txt), and build using CMake. For NVIDIA GPUs, use cmake -S . -B build -DLLAMA_CUBLAS=ON. For AMD GPUs, use cmake -S . -B build -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1100.
  • Prerequisites: CMake (3.17+), Python (3.8+), pip. NVIDIA GPU with CUDA or AMD GPU with ROCm is recommended for optimal performance.
  • Model Weights: Download PowerInfer GGUF format weights from Hugging Face (e.g., PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF).
  • Demo: A Gradio demo is available: https://github.com/SJTU-IPADS/PowerInfer/assets/34213478/fe441a42-5fce-448b-a3e5-ea4abb43ba23
  • Documentation: https://github.com/SJTU-IPADS/PowerInfer

Highlighted Details

  • Achieves up to 11x speedup over llama.cpp on a single RTX 4090.
  • Supports models with ReLU/ReGLU/Squared ReLU activation functions.
  • Offers INT4 quantization support.
  • Recent updates include support for AMD GPUs (ROCm) and Windows GPU inference.
  • PowerInfer-2 framework for smartphones achieves 11.68 tokens/sec with TurboSparse-Mixtral-47B.

Maintenance & Community

  • Active development with a public Kanban board tracking progress.
  • Mentions support for ROCm/HIP in a competition.
  • Acknowledges contributions from ggml and THUNLP.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. However, it acknowledges ggml and llama.cpp, which are typically under permissive licenses (e.g., MIT). Compatibility for commercial use or closed-source linking would require explicit license confirmation.

Limitations & Caveats

  • Currently only supports models with ReLU/ReGLU/Squared ReLU activation functions, excluding popular models like Mistral or original Llama.
  • Performance degradation noted for the 70B model due to insufficient fine-tuning data.
  • Metal backend for macOS is listed as a future feature.
Health Check
Last commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)
6
Issues (30d)
4
Star History
135 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 14 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Starred by Bojan Tunguz Bojan Tunguz(AI Scientist; Formerly at NVIDIA), Mckay Wrigley Mckay Wrigley(Founder of Takeoff AI), and
8 more.

ggml by ggml-org

0.3%
13k
Tensor library for machine learning
created 2 years ago
updated 3 days ago
Feedback? Help us improve.