PowerInfer by SJTU-IPADS

LLM inference engine for local deployment on consumer GPUs

Created 2 years ago

8,552 stars

Top 6.0% on SourcePulse

View on GitHub

10 Experts Love This Project

Lysandre Debut

Chief Open-Source Officer at Hugging Face

Omar Sanseviero

DevRel at Google DeepMind

Vincent Weisser

Cofounder of Prime Intellect

Gabriel Almeida

Cofounder of Langflow

and 6 more!

Project Summary

PowerInfer is a high-speed LLM inference engine designed for local deployment on consumer-grade hardware, targeting researchers and power users seeking efficient LLM serving. It significantly accelerates inference by leveraging activation locality, enabling performance close to server-grade GPUs on single consumer GPUs.

How It Works

PowerInfer utilizes a CPU-GPU hybrid approach based on the observation that LLM inference exhibits activation locality, with a small subset of "hot" neurons consistently activated. Hot neurons are preloaded onto the GPU for fast access, while less frequently activated "cold" neurons are computed on the CPU. This strategy reduces GPU memory requirements and CPU-GPU data transfers. The engine further incorporates adaptive predictors and neuron-aware sparse operators to optimize activation efficiency and computational sparsity.

Quick Start & Requirements

Installation: Clone the repository, install Python dependencies (pip install -r requirements.txt), and build using CMake. For NVIDIA GPUs, use cmake -S . -B build -DLLAMA_CUBLAS=ON. For AMD GPUs, use cmake -S . -B build -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1100.
Prerequisites: CMake (3.17+), Python (3.8+), pip. NVIDIA GPU with CUDA or AMD GPU with ROCm is recommended for optimal performance.
Model Weights: Download PowerInfer GGUF format weights from Hugging Face (e.g., PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF).
Demo: A Gradio demo is available: https://github.com/SJTU-IPADS/PowerInfer/assets/34213478/fe441a42-5fce-448b-a3e5-ea4abb43ba23
Documentation: https://github.com/SJTU-IPADS/PowerInfer

Highlighted Details

Achieves up to 11x speedup over llama.cpp on a single RTX 4090.
Supports models with ReLU/ReGLU/Squared ReLU activation functions.
Offers INT4 quantization support.
Recent updates include support for AMD GPUs (ROCm) and Windows GPU inference.
PowerInfer-2 framework for smartphones achieves 11.68 tokens/sec with TurboSparse-Mixtral-47B.

Maintenance & Community

Active development with a public Kanban board tracking progress.
Mentions support for ROCm/HIP in a competition.
Acknowledges contributions from ggml and THUNLP.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, it acknowledges ggml and llama.cpp, which are typically under permissive licenses (e.g., MIT). Compatibility for commercial use or closed-source linking would require explicit license confirmation.

Limitations & Caveats

Currently only supports models with ReLU/ReGLU/Squared ReLU activation functions, excluding popular models like Mistral or original Llama.
Performance degradation noted for the 70B model due to insufficient fine-tuning data.
Metal backend for macOS is listed as a future feature.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

107 stars in the last 30 days