llm.c  by karpathy

LLM training in pure C/CUDA, no PyTorch needed

Created 1 year ago
27,618 stars

Top 1.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides LLM training and inference capabilities implemented in pure C and CUDA, aiming for simplicity and performance without heavy dependencies like PyTorch or Python. It's designed for developers and researchers interested in understanding LLM internals, optimizing performance, or porting LLM functionality to environments where Python is not feasible. The project focuses on reproducing GPT-2 and GPT-3 models, offering a faster alternative to PyTorch for certain operations.

How It Works

The project leverages raw C and CUDA for maximum control and performance. It implements core LLM components like attention mechanisms and feed-forward networks directly in C and CUDA kernels. This approach allows for fine-grained optimization, bypassing higher-level abstractions and potentially achieving faster execution speeds, as demonstrated by performance claims exceeding PyTorch Nightly. The codebase is structured to be educational, with well-documented kernels in the dev/cuda directory, ranging from simple to complex implementations.

Quick Start & Requirements

  • CPU Training: chmod u+x ./dev/download_starter_pack.sh && ./dev/download_starter_pack.sh && make train_gpt2 && OMP_NUM_THREADS=8 ./train_gpt2
  • GPU Training (FP32): chmod u+x ./dev/download_starter_pack.sh && ./dev/download_starter_pack.sh && make train_gpt2cu && ./train_gpt2cu
  • Prerequisites: CUDA Toolkit (for GPU), cuDNN (optional, for Flash Attention), MPI and NCCL (for multi-GPU/multi-node). Python is used for data preparation scripts.
  • Resources: Requires a CUDA-enabled GPU for GPU training. CPU training is possible but significantly slower.
  • Links: Quick Start Guide, Discussions, Discord (#llmc channel on Zero to Hero, #llmdotc on GPU MODE).

Highlighted Details

  • Achieves performance up to 7% faster than PyTorch Nightly.
  • Includes a simple CPU fp32 reference implementation in ~1,000 lines of C.
  • Supports multi-GPU and multi-node training via MPI, NCCL (FS, TCP init).
  • Offers a tutorial on implementing a Layernorm layer in C.

Maintenance & Community

The project is actively maintained by Andrej Karpathy. Developer coordination occurs in GitHub Discussions and on Discord. A variety of community ports to different languages and hardware are listed as notable forks.

Licensing & Compatibility

  • License: MIT
  • Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The cuDNN-accelerated Flash Attention path is new, disabled by default, and significantly increases compile times. Multi-node training setup can be complex depending on the environment (e.g., Slurm configuration for PMIx support). CPU training is primarily for demonstration due to performance limitations.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
1
Star History
253 stars in the last 30 days

Explore Similar Projects

Starred by David Cournapeau David Cournapeau(Author of scikit-learn), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
5 more.

lectures by gpu-mode

0.8%
5k
Lecture series for GPU-accelerated computing
Created 1 year ago
Updated 4 days ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
20 more.

accelerate by huggingface

0.3%
9k
PyTorch training helper for distributed execution
Created 4 years ago
Updated 1 day ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
20 more.

TensorRT-LLM by NVIDIA

0.5%
12k
LLM inference optimization SDK for NVIDIA GPUs
Created 2 years ago
Updated 14 hours ago
Feedback? Help us improve.