yalm  by andrewkchan

LLM inference engine in C++/CUDA for educational performance engineering

created 10 months ago
396 stars

Top 74.0% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a C++/CUDA implementation for Large Language Model (LLM) inference, designed as an educational tool for performance engineering. It targets developers and researchers interested in understanding and optimizing LLM execution from scratch, offering a library-free approach for maximum control and insight.

How It Works

Yalm implements LLM inference using C++ and CUDA, minimizing external library dependencies to focus on core computational kernels. The design prioritizes scientific understanding of optimizations and code readability, enabling users to dissect performance bottlenecks. It leverages custom kernels for operations like matrix multiplication and attention, aiming for efficient execution on NVIDIA GPUs.

Quick Start & Requirements

  • Install: Requires a C++20-compatible compiler and the CUDA toolkit.
  • Prerequisites: LLM weights in Hugging Face format (e.g., Mistral-7B-Instruct-v0.2), git-lfs, pip.
  • Setup: Convert weights using python convert.py --dtype fp16 model.yalm.
  • Run: ./build/main model.yalm -i "Your prompt"
  • Docs: Blog post "Fast LLM Inference From Scratch" linked in README.

Highlighted Details

  • Benchmarked against transformers and llama.cpp, achieving competitive throughput (63.8 tok/s FP16 on RTX 4090).
  • Focuses on performance engineering and scientific understanding of optimizations.
  • Includes a test suite for kernel validation and benchmarking.
  • Supports CPU and CUDA backends, with CUDA requiring a single NVIDIA GPU.

Maintenance & Community

  • Inspired by calm and llama.cpp, with code adapted from llama2.c.
  • No explicit community channels or roadmap mentioned.

Licensing & Compatibility

  • No license specified in the README.
  • Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

Currently supports only completion tasks, not chat interfaces. Requires an NVIDIA GPU, and the model must fit entirely within VRAM. Tested models include Mistral-v0.2, Mixtral-v0.1 (CPU only), and Llama-3.2.

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
103 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 days ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 18 hours ago
Feedback? Help us improve.