yalm  by andrewkchan

LLM inference engine in C++/CUDA for educational performance engineering

Created 11 months ago
493 stars

Top 62.7% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a C++/CUDA implementation for Large Language Model (LLM) inference, designed as an educational tool for performance engineering. It targets developers and researchers interested in understanding and optimizing LLM execution from scratch, offering a library-free approach for maximum control and insight.

How It Works

Yalm implements LLM inference using C++ and CUDA, minimizing external library dependencies to focus on core computational kernels. The design prioritizes scientific understanding of optimizations and code readability, enabling users to dissect performance bottlenecks. It leverages custom kernels for operations like matrix multiplication and attention, aiming for efficient execution on NVIDIA GPUs.

Quick Start & Requirements

  • Install: Requires a C++20-compatible compiler and the CUDA toolkit.
  • Prerequisites: LLM weights in Hugging Face format (e.g., Mistral-7B-Instruct-v0.2), git-lfs, pip.
  • Setup: Convert weights using python convert.py --dtype fp16 model.yalm.
  • Run: ./build/main model.yalm -i "Your prompt"
  • Docs: Blog post "Fast LLM Inference From Scratch" linked in README.

Highlighted Details

  • Benchmarked against transformers and llama.cpp, achieving competitive throughput (63.8 tok/s FP16 on RTX 4090).
  • Focuses on performance engineering and scientific understanding of optimizations.
  • Includes a test suite for kernel validation and benchmarking.
  • Supports CPU and CUDA backends, with CUDA requiring a single NVIDIA GPU.

Maintenance & Community

  • Inspired by calm and llama.cpp, with code adapted from llama2.c.
  • No explicit community channels or roadmap mentioned.

Licensing & Compatibility

  • No license specified in the README.
  • Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

Currently supports only completion tasks, not chat interfaces. Requires an NVIDIA GPU, and the model must fit entirely within VRAM. Tested models include Mistral-v0.2, Mixtral-v0.1 (CPU only), and Llama-3.2.

Health Check
Last Commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
68 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Tim J. Baek Tim J. Baek(Founder of Open WebUI), and
7 more.

gemma.cpp by google

0.1%
7k
C++ inference engine for Google's Gemma models
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.