LLM inference engine in C++/CUDA for educational performance engineering
Top 74.0% on sourcepulse
This project provides a C++/CUDA implementation for Large Language Model (LLM) inference, designed as an educational tool for performance engineering. It targets developers and researchers interested in understanding and optimizing LLM execution from scratch, offering a library-free approach for maximum control and insight.
How It Works
Yalm implements LLM inference using C++ and CUDA, minimizing external library dependencies to focus on core computational kernels. The design prioritizes scientific understanding of optimizations and code readability, enabling users to dissect performance bottlenecks. It leverages custom kernels for operations like matrix multiplication and attention, aiming for efficient execution on NVIDIA GPUs.
Quick Start & Requirements
git-lfs
, pip
.python convert.py --dtype fp16 model.yalm
../build/main model.yalm -i "Your prompt"
Highlighted Details
transformers
and llama.cpp
, achieving competitive throughput (63.8 tok/s FP16 on RTX 4090).Maintenance & Community
calm
and llama.cpp
, with code adapted from llama2.c
.Licensing & Compatibility
Limitations & Caveats
Currently supports only completion tasks, not chat interfaces. Requires an NVIDIA GPU, and the model must fit entirely within VRAM. Tested models include Mistral-v0.2, Mixtral-v0.1 (CPU only), and Llama-3.2.
1 month ago
1 week