nano-vllm by GeeeekExplorer

Lightweight vLLM implementation from scratch

Created 7 months ago

10,673 stars

Top 4.8% on SourcePulse

View on GitHub

7 Experts Love This Project

Jonathan Ragan-Kelley

Professor at MIT

Eric Zhang

Founding Engineer at Modal

Jiaming Song

Chief Scientist at Luma AI

Woosuk Kwon

Coauthor of vLLM

and 3 more!

Project Summary

Nano-vLLM offers a lightweight, from-scratch implementation of vLLM for fast offline inference of large language models. Targeting developers and researchers seeking a more accessible and understandable LLM inference engine, it provides comparable speeds to vLLM with a significantly smaller codebase.

How It Works

Nano-vLLM leverages a suite of optimizations including prefix caching, tensor parallelism, Torch compilation, and CUDA graphs to achieve high inference throughput. Its design prioritizes a readable codebase, aiming for around 1,200 lines of Python, making it easier to understand, modify, and extend compared to more complex inference frameworks.

Quick Start & Requirements

Install via pip: pip install git+https://github.com/GeeeekExplorer/nano-vllm.git
Requires model weights to be downloaded separately (e.g., using huggingface-cli).
Example usage and API mirroring vLLM's interface are available in example.py.

Highlighted Details

Benchmarked on an RTX 4070 Laptop (8GB VRAM) with Qwen3-0.6B, achieving 1434 tokens/s, slightly outperforming vLLM's 1361 tokens/s in a specific test configuration.
Implements key optimizations: prefix caching, tensor parallelism, Torch compilation, CUDA graphs.
Codebase is approximately 1,200 lines of Python.

Maintenance & Community

No specific information on contributors, sponsorships, or community channels (Discord/Slack) is provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. This requires clarification for commercial use or integration into closed-source projects.

Limitations & Caveats

The project's license is not specified, which may pose a barrier to commercial adoption. The README does not detail compatibility with different hardware configurations beyond the benchmarked RTX 4070 Laptop or specific CUDA versions.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

1,152 stars in the last 30 days