openinfer by openinfer-project

Pure Rust + CUDA LLM inference engine

Created 4 months ago

527 stars

Top 59.1% on SourcePulse

Project Summary

Summary

Pegainfer is a high-performance LLM inference engine built from scratch using pure Rust and custom CUDA kernels, deliberately avoiding frameworks like PyTorch or ONNX. It targets engineers and power users seeking deep control over the inference stack, offering a minimal dependency footprint and clear understanding of each operational layer. The project explores a Rust-native inference solution, delivering raw performance via hand-optimized GPU code.

How It Works

Pegainfer uses ~7K lines of Rust and ~3.4K lines of hand-written CUDA kernels. Its core design executes all computation on the GPU, eschewing CPU fallbacks. Custom CUDA kernels handle most operations (cuBLAS for GEMM). Optimizations include fused operators (attention, MLPs), BF16 storage with FP32 accumulation for stability, and CUDA Graphs during decode to minimize kernel launch overhead. Triton is used solely at build time for Ahead-of-Time (AOT) compilation of specific kernels (e.g., silu_mul), generating C wrappers for runtime.

Quick Start & Requirements

Prerequisites: Rust (2024 edition), CUDA Toolkit (nvcc, cuBLAS), CUDA-capable GPU, Python 3 with Triton for build-time AOT.
Installation: Set up Python venv (e.g., uv), install dependencies (torch, transformers, accelerate, pytest), download model weights (e.g., Qwen3-4B) via huggingface-cli.

Health Check

Last Commit

21 hours ago

Responsiveness

Inactive

Pull Requests (30d)

178

Issues (30d)

99

Star History

154 stars in the last 30 days

Explore Similar Projects

pmetal by Epistates

High-performance LLM framework for Apple Silicon

Created 6 months ago

Updated 1 month ago

bw24 by avifenesh

From-scratch LLM inference engine for consumer GPUs

Created 6 days ago

Updated 14 hours ago

llms-from-scratch-rs by nerdai

Rust LLM implementation from scratch

Created 1 year ago

Updated 5 days ago

Starred by

Lysandre Debut

Lysandre Debut(Chief Open-Source Officer at Hugging Face).

xinfer by guoqingbao

Pure Rust LLM inference engine

Created 1 year ago

Updated 1 day ago

Starred by

Luis Capelo

Luis Capelo(Cofounder of Lightning AI),

Jason Knight

Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and

1 more.

cutile-rs by NVlabs

Safe, tile-based GPU kernel programming in Rust

Created 4 months ago

Updated 2 days ago

awesome-cuda-and-hpc by coderonion

Curated list of CUDA and HPC resources

Created 3 years ago

Updated 11 months ago

qwen600 by yassa9

Static, single-batch CUDA inference engine for QWEN3-0.6B

Created 10 months ago

Updated 10 months ago

Starred by

Didier Lopes

Didier Lopes(Founder of OpenBB),

Pietro Schirano

Pietro Schirano(Founder of MagicPath), and

1 more.

rvllm by m0at

High-performance LLM inference engine in Rust

Created 3 months ago

Updated 1 week ago

atlas by Avarok-Cybersecurity

High-performance LLM inference engine in pure Rust

Created 2 months ago

Updated 1 day ago

llm_note by harleyszhang

LLM course notes covering model inference, transformer structure, and framework code

Created 1 year ago

Updated 1 week ago

tensorrt-cpp-api by cyrusbehr

High-performance GPU inference C++ library

Created 4 years ago

Updated 1 month ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Jason Knight

Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and

26 more.

candle by huggingface

Minimalist ML framework for Rust, emphasizing performance and ease of use

Created 3 years ago

Updated 5 days ago

Feedback? Help us improve.