llama3.cuda by likejazz

C/CUDA implementation for Llama 3 model

Created 1 year ago

344 stars

Top 80.4% on SourcePulse

Project Summary

This repository provides a pure C/CUDA implementation of the Llama 3 model, targeting researchers and developers seeking high-performance inference without complex dependencies. It offers a significant speedup over CPU-based implementations, enabling faster experimentation and deployment of Llama 3.

How It Works

The implementation leverages pure C and CUDA for maximum performance and minimal dependencies. It builds upon the foundational work of llama2.c for the model and tokenizer logic, and incorporates CUDA kernels from rogerallen and ankan-ban. The design prioritizes a single-file, dependency-free structure for ease of compilation and use, with careful attention to reducing floating-point errors to match NumPy implementation results.

Quick Start & Requirements

Install: make
Requirements: NVIDIA GPU with CUDA support.
Demo: ./runcuda "I have a dream"

Highlighted Details

Achieves 2,823 tokens/s on an NVIDIA 4080 SUPER, an 85x speedup over NumPy on an M2 MacBook Air.
Single-file, dependency-free C implementation with Makefile and CMake support.
Aims for identical results to NumPy implementation with <0.5% floating-point error rate.
Includes a UTF-8 tokenizer implementation.

Maintenance & Community

The project is maintained by Sang Park. Further development is planned to include ROCm and oneAPI support, and to implement Flash Attention correctly.

Licensing & Compatibility

License: MIT
Compatibility: Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The tokenizer implementation required "messy monkey patching" for compatibility, with plans for future refinement. Multi-Head Attention is currently handled by a single kernel using GEMV operations, which is noted as somewhat inefficient compared to GEMM.

llama3.cuda by likejazz

Explore Similar Projects

fp6_llm by usyd-fsalab

gpu-optimization-workshop by mlops-discord

GPU-Benchmarks-on-LLM-Inference by XiongjieDai

CUDATutorial by PaddleJitLab

llama3.np by likejazz

MatX by NVIDIA

KuiperLLama by zjhellofss

raft by rapidsai

how-to-optim-algorithm-in-cuda by BBuf

DeepBench by baidu-research

cuda-course by Infatoshi

DeepGEMM by deepseek-ai