rvllm  by m0at

High-performance LLM inference engine in Rust

Created 2 weeks ago

New!

414 stars

Top 70.7% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

rvLLM is a high-performance LLM inference engine written from scratch in Rust, designed as a drop-in replacement for vLLM. It targets engineers and power users seeking dramatically improved resource efficiency, faster startup times, and smaller deployment footprints compared to Python-based alternatives. The project offers a compelling alternative for efficient LLM serving.

How It Works

The core innovation lies in its pure Rust implementation, eliminating Python's overhead (GIL, GC, interpreter). rvLLM employs a novel Rust-native PTX compiler that generates fused GPU kernels at model load time, achieving 2-7.5x speedups over hand-written CUDA for specific operations. It features an FA3 v3 attention mechanism with cp.async and split-KV for long contexts, alongside CUDA graph replay and cuBLAS autotuning for optimized execution.

Quick Start & Requirements

  • Install: cargo install rvllm or pip install rvllm. Source build requires cargo build --release --features cuda.
  • Prerequisites: A CUDA-enabled GPU is essential. Building from source requires the Rust toolchain.
  • Docs: Key architectural and benchmark details are available in docs/arch.md, docs/benchmark-history.md, and docs/cutlass-epilogue-spec.md.

Highlighted Details

  • Achieves 12,312 tok/s at 128 concurrent streams (0.85x vLLM direct engine).
  • Offers 20x faster cold start (6s vs ~120s) and a 31x smaller binary size (16 MB vs ~500 MB).
  • Features JIT-compiled fused kernels that are 2-7.5x faster than hand-written CUDA for single-token decode.
  • Provides 3x less CPU memory usage and a 5.6x tighter P95 latency spread due to the absence of Python overhead.

Maintenance & Community

The provided README does not contain specific details regarding maintainers, community channels (like Discord/Slack), or a public roadmap.

Licensing & Compatibility

The license type is not explicitly stated in the provided README content. Compatibility for commercial use or closed-source linking cannot be determined without this information.

Limitations & Caveats

rvLLM exhibits performance gaps compared to vLLM, particularly in HTTP throughput (0.67-0.88x) and direct engine throughput (0.82-0.96x), primarily due to differences in GEMM tuning and attention kernel optimizations. Its scheduler is less mature than vLLM's. Quantization support is limited to FP8 weights, whereas vLLM supports a wider range of formats. Speculative decoding is experimental and shows limited benefit on smaller models.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
18
Issues (30d)
20
Star History
418 stars in the last 15 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
12 more.

mistral.rs by EricLBuehler

1.6%
7k
LLM inference engine for blazing fast performance
Created 2 years ago
Updated 1 day ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
61 more.

vllm by vllm-project

1.2%
76k
LLM serving engine for high-throughput, memory-efficient inference
Created 3 years ago
Updated 20 hours ago
Feedback? Help us improve.