nano-vllm  by GeeeekExplorer

Lightweight vLLM implementation from scratch

created 1 month ago
5,542 stars

Top 9.3% on sourcepulse

GitHubView on GitHub
Project Summary

Nano-vLLM offers a lightweight, from-scratch implementation of vLLM for fast offline inference of large language models. Targeting developers and researchers seeking a more accessible and understandable LLM inference engine, it provides comparable speeds to vLLM with a significantly smaller codebase.

How It Works

Nano-vLLM leverages a suite of optimizations including prefix caching, tensor parallelism, Torch compilation, and CUDA graphs to achieve high inference throughput. Its design prioritizes a readable codebase, aiming for around 1,200 lines of Python, making it easier to understand, modify, and extend compared to more complex inference frameworks.

Quick Start & Requirements

  • Install via pip: pip install git+https://github.com/GeeeekExplorer/nano-vllm.git
  • Requires model weights to be downloaded separately (e.g., using huggingface-cli).
  • Example usage and API mirroring vLLM's interface are available in example.py.

Highlighted Details

  • Benchmarked on an RTX 4070 Laptop (8GB VRAM) with Qwen3-0.6B, achieving 1434 tokens/s, slightly outperforming vLLM's 1361 tokens/s in a specific test configuration.
  • Implements key optimizations: prefix caching, tensor parallelism, Torch compilation, CUDA graphs.
  • Codebase is approximately 1,200 lines of Python.

Maintenance & Community

No specific information on contributors, sponsorships, or community channels (Discord/Slack) is provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license. This requires clarification for commercial use or integration into closed-source projects.

Limitations & Caveats

The project's license is not specified, which may pose a barrier to commercial adoption. The README does not detail compatibility with different hardware configurations beyond the benchmarked RTX 4070 Laptop or specific CUDA versions.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
13
Issues (30d)
13
Star History
5,638 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
2 more.

LightLLM by ModelTC

0.7%
3k
Python framework for LLM inference and serving
created 2 years ago
updated 17 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Tobi Lutke Tobi Lutke(Cofounder of Shopify), and
27 more.

vllm by vllm-project

1.0%
54k
LLM serving engine for high-throughput, memory-efficient inference
created 2 years ago
updated 17 hours ago
Feedback? Help us improve.