tiny-universe by datawhalechina

"White-box" guide for building LLMs from scratch

created 1 year ago

3,420 stars

Top 14.5% on sourcepulse

Project Summary

This project provides a comprehensive, hands-on guide to building a Large Language Model (LLM) ecosystem from the ground up. Targeting individuals with a traditional deep learning background, it offers a "white-box" approach to understanding and replicating core LLM components, including diffusion models, RAG frameworks, agent systems, and evaluation metrics, enabling users to build their own "Tiny LLM Universe."

How It Works

The project emphasizes a "purely hand-crafted" methodology, eschewing high-level APIs and toolkits. It breaks down complex LLM concepts into manageable, code-first implementations, starting from fundamental principles and PyTorch layers. This approach aims to demystify LLM internals, allowing for deep understanding and customization by meticulously detailing each technical aspect with accompanying code and explanations.

Quick Start & Requirements

Install/Run: Primarily Python-based, with specific components potentially runnable via Docker or direct code execution.
Prerequisites: Python, PyTorch, NumPy. Specific components like TinyLlama3 mention 2GB VRAM for training/inference. TinyDiffusion notes 2-hour pre-training completion.
Resources: Minimal GPU requirements (e.g., 2GB VRAM) are highlighted for certain modules.
Links:
- TinyDiffusion: https://www.codewithgpu.com/i/datawhalechina/tiny-universe/tiny-universe-tiny-diffusion
- TinyRAG: https://www.codewithgpu.com/i/datawhalechina/tiny-universe/tiny-universe-tiny-rag
- Datawhale Video Channel: Search "动手搭建一个最小Agent系统"

Highlighted Details

Full LLM Lifecycle: Covers Model, RAG, Agent, and Evaluation from scratch.
"White-Box" Design: Code is intentionally simplified for beginner developers to understand internals.
Component Modules: Includes TinyDiffusion (image generation), Qwen-Blog (LLM architecture), TinyRAG (retrieval-augmented generation), TinyAgent (minimal agent system), TinyEval (evaluation metrics), TinyLLM (basic LLM training), and TinyTransformer (Transformer architecture).
Low Resource Footprint: Modules like TinyLLM and TinyLlama3 are designed to run on minimal hardware (e.g., 2GB VRAM).

Maintenance & Community

Contributors: Led by Xiao Hongru, Song Zhixue, and Zou Yuheng, with contributions from Liu Xiaoyu and others.
Community: Encourages contributions via issues and participation in the Datawhale learning community (thousands of members across various AI fields).
Links: WeChat Official Account: Datawhale.

Licensing & Compatibility

License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Compatibility: The "Non-Commercial" clause restricts usage in commercial products or closed-source applications without explicit permission or alternative licensing.

Limitations & Caveats

The CC BY-NC-SA 4.0 license imposes significant restrictions on commercial use. While the project aims for clarity, the "purely hand-crafted" nature might require substantial effort for users accustomed to high-level frameworks to adapt for production environments.

Health Check

Last commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

662 stars in the last 90 days