llama-nuts-and-bolts  by adalkiran

Llama 3.1 inference implementation, educational resource

Created 1 year ago
314 stars

Top 85.9% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a deep dive into the practical implementation of the Llama 3.1 8B-Instruct model, targeting engineers and researchers seeking to understand LLM internals beyond theoretical concepts. It offers a complete, dependency-free reimplementation of the model's inference pipeline in Go, enabling a granular understanding of each component.

How It Works

The project meticulously reconstructs Llama 3.1's architecture and inference process from the ground up, avoiding external libraries. It implements core functionalities like BFloat16 data types, memory mapping, tokenization, tensor operations, and rotary positional embeddings entirely in Go. Parallelization via goroutines is used to leverage CPU cores for computations, eschewing GPGPU or SIMD acceleration for educational clarity.

Quick Start & Requirements

  • Installation: Clone the repository and build the Go executable (go build -o llama-nb cmd/main.go) or use Docker (docker-compose up -d).
  • Prerequisites: Go toolchain, wget, md5sum.
  • Model Download: Requires downloading official Llama 3.1 8B-Instruct model files (~16GB) from Meta, following instructions provided in the README.
  • Documentation: GitHub Pages

Highlighted Details

  • Implements Llama 3.1 8B-Instruct inference entirely in Go, without Python or external ML libraries.
  • Covers BFloat16 implementation, memory mapping, RoPE, and custom tensor operations.
  • Supports CPU-based inference with parallelization via goroutines.
  • Provides a CLI for predefined or custom prompts with streaming output.

Maintenance & Community

The project is maintained by adalkiran. Further community engagement details are not specified in the README.

Licensing & Compatibility

Licensed under the Apache License, Version 2.0. This license is permissive and generally compatible with commercial and closed-source applications.

Limitations & Caveats

The project is explicitly for educational purposes and has not been tested for production or commercial use. It lacks GPGPU/SIMD support and does not implement advanced sampling techniques like top-k or temperature, only outputting the highest probability tokens. Functionality is tailored specifically to the Llama 3.1 8B-Instruct model.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

LightLLM by ModelTC

0.5%
4k
Python framework for LLM inference and serving
Created 2 years ago
Updated 14 hours ago
Starred by Lianmin Zheng Lianmin Zheng(Coauthor of SGLang, vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

MiniCPM by OpenBMB

0.4%
8k
Ultra-efficient LLMs for end devices, achieving 5x+ speedup
Created 1 year ago
Updated 1 week ago
Starred by Roy Frostig Roy Frostig(Coauthor of JAX; Research Scientist at Google DeepMind), Zhiqiang Xie Zhiqiang Xie(Coauthor of SGLang), and
40 more.

llama by meta-llama

0.1%
59k
Inference code for Llama 2 models (deprecated)
Created 2 years ago
Updated 7 months ago
Feedback? Help us improve.