yalm by andrewkchan

LLM inference engine in C++/CUDA for educational performance engineering

Created 1 year ago

545 stars

Top 58.5% on SourcePulse

Project Summary

This project provides a C++/CUDA implementation for Large Language Model (LLM) inference, designed as an educational tool for performance engineering. It targets developers and researchers interested in understanding and optimizing LLM execution from scratch, offering a library-free approach for maximum control and insight.

How It Works

Yalm implements LLM inference using C++ and CUDA, minimizing external library dependencies to focus on core computational kernels. The design prioritizes scientific understanding of optimizations and code readability, enabling users to dissect performance bottlenecks. It leverages custom kernels for operations like matrix multiplication and attention, aiming for efficient execution on NVIDIA GPUs.

Quick Start & Requirements

Install: Requires a C++20-compatible compiler and the CUDA toolkit.
Prerequisites: LLM weights in Hugging Face format (e.g., Mistral-7B-Instruct-v0.2), git-lfs, pip.
Setup: Convert weights using python convert.py --dtype fp16 model.yalm.
Run: ./build/main model.yalm -i "Your prompt"
Docs: Blog post "Fast LLM Inference From Scratch" linked in README.

Highlighted Details

Benchmarked against transformers and llama.cpp, achieving competitive throughput (63.8 tok/s FP16 on RTX 4090).
Focuses on performance engineering and scientific understanding of optimizations.
Includes a test suite for kernel validation and benchmarking.
Supports CPU and CUDA backends, with CUDA requiring a single NVIDIA GPU.

Maintenance & Community

Inspired by calm and llama.cpp, with code adapted from llama2.c.
No explicit community channels or roadmap mentioned.

Licensing & Compatibility

No license specified in the README.
Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

Currently supports only completion tasks, not chat interfaces. Requires an NVIDIA GPU, and the model must fit entirely within VRAM. Tested models include Mistral-v0.2, Mixtral-v0.1 (CPU only), and Llama-3.2.

yalm by andrewkchan

Explore Similar Projects

flex-nano-vllm by changjonathanc

TPA by tensorgi

LLaMA_MPS by jankais3r

LLM-Viewer by hahnyuan

xpu-perf by bytedance

lightning-thunder by Lightning-AI

bolt by huawei-noah

DeepBench by baidu-research

intel-extension-for-pytorch by intel

mini-sglang by sgl-project

CTranslate2 by OpenNMT

gemma.cpp by google