AIInfra  by Infrasys-AI

AI infrastructure course for large model systems design

created 1 year ago
3,716 stars

Top 13.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive, open-source curriculum on AI infrastructure, focusing on the hardware and software stack required for large language model (LLM) training and inference. It targets advanced undergraduates, graduate students, and AI system practitioners seeking to understand the end-to-end lifecycle of AI systems. The benefit is a structured learning path covering everything from AI chip architecture to distributed training and cutting-edge LLM algorithms.

How It Works

The curriculum is structured into modules covering AI chip principles, communication and storage, AI clusters, LLM training, LLM inference, LLM algorithms, and hot technical topics. It delves into hardware architectures (GPUs, TPUs, NPUs), distributed systems concepts (parallelism, communication libraries like NCCL), LLM-specific techniques (Transformer variants, quantization, efficient inference), and emerging trends like AI agents. The approach emphasizes a full-stack perspective, bridging hardware capabilities with software frameworks and algorithmic advancements.

Quick Start & Requirements

  • Content Access: The primary way to access content is via the Releases page due to the repository's large size (10GB+). Direct git clone is discouraged due to potential slowness.
  • Prerequisites: No specific software installation is required to view the curriculum content (slides, PPTs). However, understanding the topics requires foundational knowledge in computer architecture, operating systems, and deep learning.
  • Resources: Viewing the content requires disk space for the downloaded materials.

Highlighted Details

  • Comprehensive coverage of AI hardware, including NVIDIA GPUs, TPUs, NPUs, and domestic AI processors.
  • In-depth exploration of distributed training techniques (TP, PP, EP, SP, DP) and communication primitives.
  • Analysis of LLM inference optimization frameworks (vLLM, SGLang) and techniques like quantization and long-sequence handling.
  • Regular updates on the latest LLM algorithms and hot topics, including specific model deep dives (e.g., DeepSeek, Llama).

Maintenance & Community

The project is actively maintained by chenzomi12, with content being continuously updated. Contributions are welcomed via Pull Requests. Video content is hosted on Bilibili and YouTube (ZOMI酱, ZOMI6222).

Licensing & Compatibility

The repository content is open-source, with an expectation for proper attribution when using PPT materials. Specific licensing details for the content itself are not explicitly stated beyond a general encouragement for use and contribution.

Limitations & Caveats

Many sections are marked as "待更" (to be updated) or "更新中" (updating), indicating the curriculum is still under active development. The sheer size of the repository may pose challenges for users with limited bandwidth or storage.

Health Check
Last commit

3 days ago

Responsiveness

1 week

Pull Requests (30d)
56
Issues (30d)
3
Star History
1,443 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

cookbook by EleutherAI

0.1%
809
Deep learning resource for practical model work
created 1 year ago
updated 4 days ago
Feedback? Help us improve.