rvllm  by m0at

High-performance LLM inference engine in Rust

Created 2 months ago
723 stars

Top 47.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

rvLLM is a high-performance LLM inference engine written from scratch in Rust, designed as a drop-in replacement for vLLM. It targets engineers and power users seeking dramatically improved resource efficiency, faster startup times, and smaller deployment footprints compared to Python-based alternatives. The project offers a compelling alternative for efficient LLM serving.

How It Works

The core innovation lies in its pure Rust implementation, eliminating Python's overhead (GIL, GC, interpreter). rvLLM employs a novel Rust-native PTX compiler that generates fused GPU kernels at model load time, achieving 2-7.5x speedups over hand-written CUDA for specific operations. It features an FA3 v3 attention mechanism with cp.async and split-KV for long contexts, alongside CUDA graph replay and cuBLAS autotuning for optimized execution.

Quick Start & Requirements

  • Install: cargo install rvllm or pip install rvllm. Source build requires cargo build --release --features cuda.
  • Prerequisites: A CUDA-enabled GPU is essential. Building from source requires the Rust toolchain.
  • Docs: Key architectural and benchmark details are available in docs/arch.md, docs/benchmark-history.md, and docs/cutlass-epilogue-spec.md.

Highlighted Details

  • Achieves 12,312 tok/s at 128 concurrent streams (0.85x vLLM direct engine).
  • Offers 20x faster cold start (6s vs ~120s) and a 31x smaller binary size (16 MB vs ~500 MB).
  • Features JIT-compiled fused kernels that are 2-7.5x faster than hand-written CUDA for single-token decode.
  • Provides 3x less CPU memory usage and a 5.6x tighter P95 latency spread due to the absence of Python overhead.

Maintenance & Community

The provided README does not contain specific details regarding maintainers, community channels (like Discord/Slack), or a public roadmap.

Licensing & Compatibility

The license type is not explicitly stated in the provided README content. Compatibility for commercial use or closed-source linking cannot be determined without this information.

Limitations & Caveats

rvLLM exhibits performance gaps compared to vLLM, particularly in HTTP throughput (0.67-0.88x) and direct engine throughput (0.82-0.96x), primarily due to differences in GEMM tuning and attention kernel optimizations. Its scheduler is less mature than vLLM's. Quantization support is limited to FP8 weights, whereas vLLM supports a wider range of formats. Speculative decoding is experimental and shows limited benefit on smaller models.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
8
Issues (30d)
0
Star History
34 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
12 more.

mistral.rs by EricLBuehler

0.3%
7k
LLM inference engine for blazing fast performance
Created 2 years ago
Updated 3 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
62 more.

vllm by vllm-project

0.7%
81k
LLM serving engine for high-throughput, memory-efficient inference
Created 3 years ago
Updated 9 hours ago
Feedback? Help us improve.