llama-nuts-and-bolts by adalkiran

Llama 3.1 inference implementation, educational resource

Created 2 years ago

317 stars

Top 85.4% on SourcePulse

Project Summary

This project provides a deep dive into the practical implementation of the Llama 3.1 8B-Instruct model, targeting engineers and researchers seeking to understand LLM internals beyond theoretical concepts. It offers a complete, dependency-free reimplementation of the model's inference pipeline in Go, enabling a granular understanding of each component.

How It Works

The project meticulously reconstructs Llama 3.1's architecture and inference process from the ground up, avoiding external libraries. It implements core functionalities like BFloat16 data types, memory mapping, tokenization, tensor operations, and rotary positional embeddings entirely in Go. Parallelization via goroutines is used to leverage CPU cores for computations, eschewing GPGPU or SIMD acceleration for educational clarity.

Quick Start & Requirements

Installation: Clone the repository and build the Go executable (go build -o llama-nb cmd/main.go) or use Docker (docker-compose up -d).
Prerequisites: Go toolchain, wget, md5sum.
Model Download: Requires downloading official Llama 3.1 8B-Instruct model files (~16GB) from Meta, following instructions provided in the README.
Documentation: GitHub Pages

Highlighted Details

Implements Llama 3.1 8B-Instruct inference entirely in Go, without Python or external ML libraries.
Covers BFloat16 implementation, memory mapping, RoPE, and custom tensor operations.
Supports CPU-based inference with parallelization via goroutines.
Provides a CLI for predefined or custom prompts with streaming output.

Maintenance & Community

The project is maintained by adalkiran. Further community engagement details are not specified in the README.

Licensing & Compatibility

Licensed under the Apache License, Version 2.0. This license is permissive and generally compatible with commercial and closed-source applications.

Limitations & Caveats

The project is explicitly for educational purposes and has not been tested for production or commercial use. It lacks GPGPU/SIMD support and does not implement advanced sampling techniques like top-k or temperature, only outputting the highest probability tokens. Functionality is tailored specifically to the Llama 3.1 8B-Instruct model.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days