Discover and explore top open-source AI tools and projects—updated daily.
xiaguanPure Rust + CUDA LLM inference engine
Top 94.7% on SourcePulse
Summary
Pegainfer is a high-performance LLM inference engine built from scratch using pure Rust and custom CUDA kernels, deliberately avoiding frameworks like PyTorch or ONNX. It targets engineers and power users seeking deep control over the inference stack, offering a minimal dependency footprint and clear understanding of each operational layer. The project explores a Rust-native inference solution, delivering raw performance via hand-optimized GPU code.
How It Works
Pegainfer uses ~7K lines of Rust and ~3.4K lines of hand-written CUDA kernels. Its core design executes all computation on the GPU, eschewing CPU fallbacks. Custom CUDA kernels handle most operations (cuBLAS for GEMM). Optimizations include fused operators (attention, MLPs), BF16 storage with FP32 accumulation for stability, and CUDA Graphs during decode to minimize kernel launch overhead. Triton is used solely at build time for Ahead-of-Time (AOT) compilation of specific kernels (e.g., silu_mul), generating C wrappers for runtime.
Quick Start & Requirements
uv), install dependencies (torch, transformers, accelerate, pytest), download model weights (e.g., Qwen3-4B) via huggingface-cli.1 day ago
Inactive
Noeda
ELS-RD
huggingface