rvllm by m0at

High-performance LLM inference engine in Rust

Created 3 months ago

754 stars

Top 45.4% on SourcePulse

View on GitHub

3 Experts Love This Project

Author of Bend, Kind, HVM

Project Summary

Summary

rvLLM is a high-performance LLM inference engine written from scratch in Rust, designed as a drop-in replacement for vLLM. It targets engineers and power users seeking dramatically improved resource efficiency, faster startup times, and smaller deployment footprints compared to Python-based alternatives. The project offers a compelling alternative for efficient LLM serving.

How It Works

The core innovation lies in its pure Rust implementation, eliminating Python's overhead (GIL, GC, interpreter). rvLLM employs a novel Rust-native PTX compiler that generates fused GPU kernels at model load time, achieving 2-7.5x speedups over hand-written CUDA for specific operations. It features an FA3 v3 attention mechanism with cp.async and split-KV for long contexts, alongside CUDA graph replay and cuBLAS autotuning for optimized execution.

Quick Start & Requirements

Install: cargo install rvllm or pip install rvllm. Source build requires cargo build --release --features cuda.
Prerequisites: A CUDA-enabled GPU is essential. Building from source requires the Rust toolchain.
Docs: Key architectural and benchmark details are available in docs/arch.md, docs/benchmark-history.md, and docs/cutlass-epilogue-spec.md.

Highlighted Details

Achieves 12,312 tok/s at 128 concurrent streams (0.85x vLLM direct engine).
Offers 20x faster cold start (6s vs ~120s) and a 31x smaller binary size (16 MB vs ~500 MB).
Features JIT-compiled fused kernels that are 2-7.5x faster than hand-written CUDA for single-token decode.
Provides 3x less CPU memory usage and a 5.6x tighter P95 latency spread due to the absence of Python overhead.

Maintenance & Community

The provided README does not contain specific details regarding maintainers, community channels (like Discord/Slack), or a public roadmap.

Licensing & Compatibility

The license type is not explicitly stated in the provided README content. Compatibility for commercial use or closed-source linking cannot be determined without this information.

Limitations & Caveats

rvLLM exhibits performance gaps compared to vLLM, particularly in HTTP throughput (0.67-0.88x) and direct engine throughput (0.82-0.96x), primarily due to differences in GEMM tuning and attention kernel optimizations. Its scheduler is less mature than vLLM's. Quantization support is limited to FP8 weights, whereas vLLM supports a wider range of formats. Speculative decoding is experimental and shows limited benefit on smaller models.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

19 stars in the last 30 days