nano-vllm  by GeeeekExplorer

Lightweight vLLM implementation from scratch

Created 3 months ago
6,685 stars

Top 7.7% on SourcePulse

GitHubView on GitHub
Project Summary

Nano-vLLM offers a lightweight, from-scratch implementation of vLLM for fast offline inference of large language models. Targeting developers and researchers seeking a more accessible and understandable LLM inference engine, it provides comparable speeds to vLLM with a significantly smaller codebase.

How It Works

Nano-vLLM leverages a suite of optimizations including prefix caching, tensor parallelism, Torch compilation, and CUDA graphs to achieve high inference throughput. Its design prioritizes a readable codebase, aiming for around 1,200 lines of Python, making it easier to understand, modify, and extend compared to more complex inference frameworks.

Quick Start & Requirements

  • Install via pip: pip install git+https://github.com/GeeeekExplorer/nano-vllm.git
  • Requires model weights to be downloaded separately (e.g., using huggingface-cli).
  • Example usage and API mirroring vLLM's interface are available in example.py.

Highlighted Details

  • Benchmarked on an RTX 4070 Laptop (8GB VRAM) with Qwen3-0.6B, achieving 1434 tokens/s, slightly outperforming vLLM's 1361 tokens/s in a specific test configuration.
  • Implements key optimizations: prefix caching, tensor parallelism, Torch compilation, CUDA graphs.
  • Codebase is approximately 1,200 lines of Python.

Maintenance & Community

No specific information on contributors, sponsorships, or community channels (Discord/Slack) is provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. This requires clarification for commercial use or integration into closed-source projects.

Limitations & Caveats

The project's license is not specified, which may pose a barrier to commercial adoption. The README does not detail compatibility with different hardware configurations beyond the benchmarked RTX 4070 Laptop or specific CUDA versions.

Health Check
Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
7
Issues (30d)
7
Star History
825 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Tim J. Baek Tim J. Baek(Founder of Open WebUI), and
7 more.

gemma.cpp by google

0.1%
7k
C++ inference engine for Google's Gemma models
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.