tiny-vllm by jmaczan

High-performance LLM inference engine in C++/CUDA

Created 5 months ago

942 stars

Top 38.1% on SourcePulse

Project Summary

Summary

This project offers a C++ and CUDA-based LLM inference engine, designed as a smaller, educational alternative to vLLM. It targets engineers, researchers, and students aiming to build high-performance inference systems from scratch, providing deep insights into CUDA kernel engineering, memory management, and advanced inference techniques.

How It Works

Built entirely in C++ and CUDA, the engine prioritizes direct GPU computation. It implements core LLM components like model loading (Safetensors), prefill/decode, KV caching, and batching strategies (static, continuous). Key features include custom CUDA kernels for operations such as embeddings, attention (GQA), RMSNorm, RoPE, and feed-forward networks, leveraging cuBLAS for matrix multiplication and PagedAttention for memory management.

Quick Start & Requirements

Installation: Execute ./test.sh after setting up dependencies.
Prerequisites: NVIDIA GPU, Linux (tested with CUDA 13.1, C++17), nlohmann/json, and Llama 3.2 1B Instruct model weights (model.safetensors).
Documentation: The README serves as a comprehensive course.

Highlighted Details

Educational Depth: Detailed CUDA kernel implementations for LLM operations (embeddings, attention, RoPE, RMSNorm).
Low-Level CUDA Engineering: Direct GPU kernel development for performance.
Advanced Techniques: PagedAttention, continuous batching, online softmax.
cuBLAS Integration: Optimized matrix multiplication with explanation of transposition tricks.
Memory Management: GPU memory allocation, data transfer, Paged KV cache.
Precision Analysis: In-depth discussion on FP16/BF16 formats.

Maintenance & Community

Maintained by Jędrzej Maczan, with support via GitHub Issues. No dedicated community channels are listed.

Licensing & Compatibility

Licensed under Apache License 2.0, permitting commercial use with attribution.

Limitations & Caveats

Strictly requires an NVIDIA GPU. Tested on specific Linux/CUDA/GCC versions; may need environment adjustments. High learning curve due to low-level focus and extensive detail. Some README sections are marked as < TODO >. Primary example uses Llama 3.2 1B Instruct.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

131 stars in the last 30 days