LLM inference engine for local deployment on consumer GPUs
Top 6.4% on sourcepulse
PowerInfer is a high-speed LLM inference engine designed for local deployment on consumer-grade hardware, targeting researchers and power users seeking efficient LLM serving. It significantly accelerates inference by leveraging activation locality, enabling performance close to server-grade GPUs on single consumer GPUs.
How It Works
PowerInfer utilizes a CPU-GPU hybrid approach based on the observation that LLM inference exhibits activation locality, with a small subset of "hot" neurons consistently activated. Hot neurons are preloaded onto the GPU for fast access, while less frequently activated "cold" neurons are computed on the CPU. This strategy reduces GPU memory requirements and CPU-GPU data transfers. The engine further incorporates adaptive predictors and neuron-aware sparse operators to optimize activation efficiency and computational sparsity.
Quick Start & Requirements
pip install -r requirements.txt
), and build using CMake. For NVIDIA GPUs, use cmake -S . -B build -DLLAMA_CUBLAS=ON
. For AMD GPUs, use cmake -S . -B build -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1100
.PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF
).Highlighted Details
llama.cpp
on a single RTX 4090.Maintenance & Community
Licensing & Compatibility
ggml
and llama.cpp
, which are typically under permissive licenses (e.g., MIT). Compatibility for commercial use or closed-source linking would require explicit license confirmation.Limitations & Caveats
5 days ago
1 day