pegainfer  by xiaguan

Pure Rust + CUDA LLM inference engine

Created 1 month ago
272 stars

Top 94.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Pegainfer is a high-performance LLM inference engine built from scratch using pure Rust and custom CUDA kernels, deliberately avoiding frameworks like PyTorch or ONNX. It targets engineers and power users seeking deep control over the inference stack, offering a minimal dependency footprint and clear understanding of each operational layer. The project explores a Rust-native inference solution, delivering raw performance via hand-optimized GPU code.

How It Works

Pegainfer uses ~7K lines of Rust and ~3.4K lines of hand-written CUDA kernels. Its core design executes all computation on the GPU, eschewing CPU fallbacks. Custom CUDA kernels handle most operations (cuBLAS for GEMM). Optimizations include fused operators (attention, MLPs), BF16 storage with FP32 accumulation for stability, and CUDA Graphs during decode to minimize kernel launch overhead. Triton is used solely at build time for Ahead-of-Time (AOT) compilation of specific kernels (e.g., silu_mul), generating C wrappers for runtime.

Quick Start & Requirements

  • Prerequisites: Rust (2024 edition), CUDA Toolkit (nvcc, cuBLAS), CUDA-capable GPU, Python 3 with Triton for build-time AOT.
  • Installation: Set up Python venv (e.g., uv), install dependencies (torch, transformers, accelerate, pytest), download model weights (e.g., Qwen3-4B) via huggingface-cli.
Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
51
Issues (30d)
4
Star History
85 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.