ai-infra-hpc by jinbooooom

AI Infrastructure and HPC essentials

Created 1 year ago

433 stars

Top 68.3% on SourcePulse

Project Summary

This repository serves as a comprehensive tutorial and knowledge base for AI infrastructure and High-Performance Computing (HPC), detailing low-level interconnects, parallel programming models, and large-scale model training techniques. It targets engineers and researchers needing a deep understanding of hardware-software co-design for demanding AI workloads, offering insights into optimizing performance from chip to cluster.

How It Works

The project systematically covers foundational HPC concepts including CUDA programming, SIMD, OpenMP, and critical interconnects like PCIe, NVLink, InfiniBand, and RDMA. It delves into collective communication libraries (MPI, NCCL) and advanced AI training paradigms such as data, model, and pipeline parallelism, alongside distributed training frameworks like DeepSpeed and DeepEP. The content is structured to build understanding from hardware primitives to complex distributed training strategies.

Quick Start & Requirements

This repository functions as an educational resource rather than a runnable project. It lacks explicit installation or execution commands, focusing instead on detailed explanations and code snippets for understanding core concepts. Setup involves acquiring relevant hardware (GPUs, InfiniBand) and software environments (CUDA Toolkit, OFED) as per individual learning goals.

Highlighted Details

In-depth CUDA programming guide covering execution models, memory hierarchy, streams, concurrency, and debugging tools (Nsight, CUDA-GDB).
Detailed exploration of GPU interconnects like NVLink/NVSwitch and low-level communication protocols (GPUDirect, RDMA, InfiniBand, RoCE).
Comprehensive analysis of NCCL algorithms, protocols, and source code for efficient multi-GPU communication.
Extensive coverage of distributed training strategies for large models, including DP, DDP, TP, PP, ZeRO, and DeepSpeed/DeepEP.

Maintenance & Community

No information on contributors, community channels (Discord/Slack), or roadmap is present in the provided text.

Licensing & Compatibility

No license information is provided.

Limitations & Caveats

This is a learning repository, not a production-ready library. It assumes significant prior knowledge in systems programming and HPC. The content is a collection of notes and tutorials, requiring users to synthesize and apply the information to specific use cases.

ai-infra-hpc by jinbooooom

Explore Similar Projects

varuna by microsoft

oslo by tunib-ai

gpu-optimization-workshop by mlops-discord

awesome-gpu-engineering by goabiaryan

awesome-cuda-and-hpc by coderonion

Omega-AI by dromara

gpu-perf-engineering-resources by wafer-ai

efficient-dl-systems by mryab

ai-performance-engineering by cfregly

lectures by gpu-mode

oneflow by Oneflow-Inc

llm.c by karpathy