Deepdive-llama3-from-scratch  by therealoliver

Llama3 inference walkthrough, step-by-step

Created 7 months ago
609 stars

Top 53.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a step-by-step, from-scratch implementation of the Llama 3 inference process, targeting engineers and researchers who want to deeply understand the model's mechanics. It offers detailed code annotations, dimension tracking, and principle explanations, including a dedicated section on KV-Cache, to facilitate a thorough grasp of Llama 3's architecture and operation.

How It Works

The project meticulously reconstructs Llama 3's inference pipeline, breaking down each component. It starts with tokenization and embedding, then details the RMS normalization, Rotary Position Encoding (RoPE) for positional information, and the multi-head attention mechanism with Grouped Query Attention (GQA). The implementation covers the Feed-Forward Network (FFN) with SwiGLU activation and residual connections, culminating in the final prediction layer. The approach emphasizes clarity through extensive inline comments and explicit dimension tracking at each step.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies via pip install -r requirements.txt.
  • Prerequisites: Python 3.8+, PyTorch, Transformers, tiktoken, regex, matplotlib. Requires downloading Llama 3 8B model weights from Meta.
  • Setup: Download Llama 3 8B weights to the Meta-Llama-3-8B/original/ directory.
  • Run: Execute Python scripts for each step (e.g., load_model.py, attention.py).
  • Docs: Project Repository

Highlighted Details

  • Step-by-step implementation of Llama 3's Transformer blocks.
  • Detailed explanation and implementation of Rotary Position Encoding (RoPE).
  • In-depth breakdown of Grouped Query Attention (GQA) and KV-Cache.
  • Bilingual (Chinese/English) documentation and code comments.

Maintenance & Community

The project is based on naklecha/llama3-from-scratch and has been significantly enhanced. It appears to be a personal project with contributions from the original author. Community interaction channels are not explicitly mentioned.

Licensing & Compatibility

The repository is licensed under the MIT License. This license is permissive and allows for commercial use and integration into closed-source projects.

Limitations & Caveats

This project focuses solely on inference and does not include training code. It requires manual download of model weights, and the code is structured for educational purposes rather than production deployment.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
15 more.

codellama by meta-llama

0.0%
16k
Inference code for CodeLlama models
Created 2 years ago
Updated 1 year ago
Starred by Roy Frostig Roy Frostig(Coauthor of JAX; Research Scientist at Google DeepMind), Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), and
40 more.

llama by meta-llama

0.1%
59k
Inference code for Llama 2 models (deprecated)
Created 2 years ago
Updated 7 months ago
Feedback? Help us improve.