llm.c by karpathy

LLM training in pure C/CUDA, no PyTorch needed

Created 1 year ago

28,577 stars

Top 1.3% on SourcePulse

View on GitHub

31 Experts Love This Project

Peter Norvig

Author of "Artificial Intelligence: A Modern Approach"; Research Director at Google

Alexey Milovidov

Cofounder of Clickhouse

Didier Lopes

Founder of OpenBB

Yaowei Zheng

Author of LLaMA-Factory

and 27 more!

Project Summary

This repository provides LLM training and inference capabilities implemented in pure C and CUDA, aiming for simplicity and performance without heavy dependencies like PyTorch or Python. It's designed for developers and researchers interested in understanding LLM internals, optimizing performance, or porting LLM functionality to environments where Python is not feasible. The project focuses on reproducing GPT-2 and GPT-3 models, offering a faster alternative to PyTorch for certain operations.

How It Works

The project leverages raw C and CUDA for maximum control and performance. It implements core LLM components like attention mechanisms and feed-forward networks directly in C and CUDA kernels. This approach allows for fine-grained optimization, bypassing higher-level abstractions and potentially achieving faster execution speeds, as demonstrated by performance claims exceeding PyTorch Nightly. The codebase is structured to be educational, with well-documented kernels in the dev/cuda directory, ranging from simple to complex implementations.

Quick Start & Requirements

CPU Training: chmod u+x ./dev/download_starter_pack.sh && ./dev/download_starter_pack.sh && make train_gpt2 && OMP_NUM_THREADS=8 ./train_gpt2
GPU Training (FP32): chmod u+x ./dev/download_starter_pack.sh && ./dev/download_starter_pack.sh && make train_gpt2cu && ./train_gpt2cu
Prerequisites: CUDA Toolkit (for GPU), cuDNN (optional, for Flash Attention), MPI and NCCL (for multi-GPU/multi-node). Python is used for data preparation scripts.
Resources: Requires a CUDA-enabled GPU for GPU training. CPU training is possible but significantly slower.
Links: Quick Start Guide, Discussions, Discord (#llmc channel on Zero to Hero, #llmdotc on GPU MODE).

Highlighted Details

Achieves performance up to 7% faster than PyTorch Nightly.
Includes a simple CPU fp32 reference implementation in ~1,000 lines of C.
Supports multi-GPU and multi-node training via MPI, NCCL (FS, TCP init).
Offers a tutorial on implementing a Layernorm layer in C.

Maintenance & Community

The project is actively maintained by Andrej Karpathy. Developer coordination occurs in GitHub Discussions and on Discord. A variety of community ports to different languages and hardware are listed as notable forks.

Licensing & Compatibility

License: MIT
Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The cuDNN-accelerated Flash Attention path is new, disabled by default, and significantly increases compile times. Multi-node training setup can be complex depending on the environment (e.g., Slurm configuration for PMIx support). CPU training is primarily for demonstration due to performance limitations.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

218 stars in the last 30 days