llama3.cuda  by likejazz

C/CUDA implementation for Llama 3 model

Created 1 year ago
348 stars

Top 79.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a pure C/CUDA implementation of the Llama 3 model, targeting researchers and developers seeking high-performance inference without complex dependencies. It offers a significant speedup over CPU-based implementations, enabling faster experimentation and deployment of Llama 3.

How It Works

The implementation leverages pure C and CUDA for maximum performance and minimal dependencies. It builds upon the foundational work of llama2.c for the model and tokenizer logic, and incorporates CUDA kernels from rogerallen and ankan-ban. The design prioritizes a single-file, dependency-free structure for ease of compilation and use, with careful attention to reducing floating-point errors to match NumPy implementation results.

Quick Start & Requirements

  • Install: make
  • Requirements: NVIDIA GPU with CUDA support.
  • Demo: ./runcuda "I have a dream"

Highlighted Details

  • Achieves 2,823 tokens/s on an NVIDIA 4080 SUPER, an 85x speedup over NumPy on an M2 MacBook Air.
  • Single-file, dependency-free C implementation with Makefile and CMake support.
  • Aims for identical results to NumPy implementation with <0.5% floating-point error rate.
  • Includes a UTF-8 tokenizer implementation.

Maintenance & Community

The project is maintained by Sang Park. Further development is planned to include ROCm and oneAPI support, and to implement Flash Attention correctly.

Licensing & Compatibility

  • License: MIT
  • Compatibility: Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The tokenizer implementation required "messy monkey patching" for compatibility, with plans for future refinement. Multi-Head Attention is currently handled by a single kernel using GEMV operations, which is noted as somewhat inefficient compared to GEMM.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Eric Zhang Eric Zhang(Founding Engineer at Modal), and
9 more.

DeepGEMM by deepseek-ai

0.4%
6k
CUDA library for efficient FP8 GEMM kernels with fine-grained scaling
Created 11 months ago
Updated 6 days ago
Starred by Tri Dao Tri Dao(Chief Scientist at Together AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
23 more.

cutlass by NVIDIA

0.5%
9k
CUDA C++ and Python DSLs for high-performance linear algebra
Created 8 years ago
Updated 3 days ago
Feedback? Help us improve.