tiny-vllm  by jmaczan

High-performance LLM inference engine in C++/CUDA

Created 4 months ago
783 stars

Top 44.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This project offers a C++ and CUDA-based LLM inference engine, designed as a smaller, educational alternative to vLLM. It targets engineers, researchers, and students aiming to build high-performance inference systems from scratch, providing deep insights into CUDA kernel engineering, memory management, and advanced inference techniques.

How It Works

Built entirely in C++ and CUDA, the engine prioritizes direct GPU computation. It implements core LLM components like model loading (Safetensors), prefill/decode, KV caching, and batching strategies (static, continuous). Key features include custom CUDA kernels for operations such as embeddings, attention (GQA), RMSNorm, RoPE, and feed-forward networks, leveraging cuBLAS for matrix multiplication and PagedAttention for memory management.

Quick Start & Requirements

  • Installation: Execute ./test.sh after setting up dependencies.
  • Prerequisites: NVIDIA GPU, Linux (tested with CUDA 13.1, C++17), nlohmann/json, and Llama 3.2 1B Instruct model weights (model.safetensors).
  • Documentation: The README serves as a comprehensive course.

Highlighted Details

  • Educational Depth: Detailed CUDA kernel implementations for LLM operations (embeddings, attention, RoPE, RMSNorm).
  • Low-Level CUDA Engineering: Direct GPU kernel development for performance.
  • Advanced Techniques: PagedAttention, continuous batching, online softmax.
  • cuBLAS Integration: Optimized matrix multiplication with explanation of transposition tricks.
  • Memory Management: GPU memory allocation, data transfer, Paged KV cache.
  • Precision Analysis: In-depth discussion on FP16/BF16 formats.

Maintenance & Community

Maintained by Jędrzej Maczan, with support via GitHub Issues. No dedicated community channels are listed.

Licensing & Compatibility

Licensed under Apache License 2.0, permitting commercial use with attribution.

Limitations & Caveats

Strictly requires an NVIDIA GPU. Tested on specific Linux/CUDA/GCC versions; may need environment adjustments. High learning curve due to low-level focus and extensive detail. Some README sections are marked as < TODO >. Primary example uses Llama 3.2 1B Instruct.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
0
Star History
656 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

2.0%
1k
LLM inference engine for diverse applications
Created 2 years ago
Updated 12 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Pankaj Gupta Pankaj Gupta(Cofounder of Baseten), and
1 more.

cccl by NVIDIA

0.6%
2k
CUDA C++ building blocks for high-performance GPU computing
Created 5 years ago
Updated 12 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
9 more.

FlashMLA by deepseek-ai

0.1%
13k
Efficient CUDA kernels for MLA decoding
Created 1 year ago
Updated 1 month ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
62 more.

vllm by vllm-project

0.5%
82k
LLM serving engine for high-throughput, memory-efficient inference
Created 3 years ago
Updated 12 hours ago
Feedback? Help us improve.