tiny-universe  by datawhalechina

"White-box" guide for building LLMs from scratch

created 1 year ago
3,420 stars

Top 14.5% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a comprehensive, hands-on guide to building a Large Language Model (LLM) ecosystem from the ground up. Targeting individuals with a traditional deep learning background, it offers a "white-box" approach to understanding and replicating core LLM components, including diffusion models, RAG frameworks, agent systems, and evaluation metrics, enabling users to build their own "Tiny LLM Universe."

How It Works

The project emphasizes a "purely hand-crafted" methodology, eschewing high-level APIs and toolkits. It breaks down complex LLM concepts into manageable, code-first implementations, starting from fundamental principles and PyTorch layers. This approach aims to demystify LLM internals, allowing for deep understanding and customization by meticulously detailing each technical aspect with accompanying code and explanations.

Quick Start & Requirements

Highlighted Details

  • Full LLM Lifecycle: Covers Model, RAG, Agent, and Evaluation from scratch.
  • "White-Box" Design: Code is intentionally simplified for beginner developers to understand internals.
  • Component Modules: Includes TinyDiffusion (image generation), Qwen-Blog (LLM architecture), TinyRAG (retrieval-augmented generation), TinyAgent (minimal agent system), TinyEval (evaluation metrics), TinyLLM (basic LLM training), and TinyTransformer (Transformer architecture).
  • Low Resource Footprint: Modules like TinyLLM and TinyLlama3 are designed to run on minimal hardware (e.g., 2GB VRAM).

Maintenance & Community

  • Contributors: Led by Xiao Hongru, Song Zhixue, and Zou Yuheng, with contributions from Liu Xiaoyu and others.
  • Community: Encourages contributions via issues and participation in the Datawhale learning community (thousands of members across various AI fields).
  • Links: WeChat Official Account: Datawhale.

Licensing & Compatibility

  • License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
  • Compatibility: The "Non-Commercial" clause restricts usage in commercial products or closed-source applications without explicit permission or alternative licensing.

Limitations & Caveats

The CC BY-NC-SA 4.0 license imposes significant restrictions on commercial use. While the project aims for clarity, the "purely hand-crafted" nature might require substantial effort for users accustomed to high-level frameworks to adapt for production environments.

Health Check
Last commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
2
Star History
662 stars in the last 90 days

Explore Similar Projects

Starred by Peter Norvig Peter Norvig(Author of Artificial Intelligence: A Modern Approach; Research Director at Google), Bojan Tunguz Bojan Tunguz(AI Scientist; Formerly at NVIDIA), and
4 more.

LLMs-from-scratch by rasbt

1.4%
61k
Educational resource for LLM construction in PyTorch
created 2 years ago
updated 1 day ago
Feedback? Help us improve.