Discover and explore top open-source AI tools and projects—updated daily.
jmaczanHigh-performance LLM inference engine in C++/CUDA
Top 44.3% on SourcePulse
Summary
This project offers a C++ and CUDA-based LLM inference engine, designed as a smaller, educational alternative to vLLM. It targets engineers, researchers, and students aiming to build high-performance inference systems from scratch, providing deep insights into CUDA kernel engineering, memory management, and advanced inference techniques.
How It Works
Built entirely in C++ and CUDA, the engine prioritizes direct GPU computation. It implements core LLM components like model loading (Safetensors), prefill/decode, KV caching, and batching strategies (static, continuous). Key features include custom CUDA kernels for operations such as embeddings, attention (GQA), RMSNorm, RoPE, and feed-forward networks, leveraging cuBLAS for matrix multiplication and PagedAttention for memory management.
Quick Start & Requirements
./test.sh after setting up dependencies.nlohmann/json, and Llama 3.2 1B Instruct model weights (model.safetensors).Highlighted Details
Maintenance & Community
Maintained by Jędrzej Maczan, with support via GitHub Issues. No dedicated community channels are listed.
Licensing & Compatibility
Licensed under Apache License 2.0, permitting commercial use with attribution.
Limitations & Caveats
Strictly requires an NVIDIA GPU. Tested on specific Linux/CUDA/GCC versions; may need environment adjustments. High learning curve due to low-level focus and extensive detail. Some README sections are marked as < TODO >. Primary example uses Llama 3.2 1B Instruct.
1 month ago
Inactive
alibaba
NVIDIA
deepseek-ai
vllm-project