SimAI  by aliyun

Simulator for large-scale AI training analysis and optimization

Created 11 months ago
652 stars

Top 51.2% on SourcePulse

GitHubView on GitHub
Project Summary

SimAI is a full-stack, high-precision simulator designed for analyzing and optimizing large-scale AI training, particularly for Large Language Models (LLMs). It targets researchers and engineers seeking to understand and improve training performance by modeling various layers of the training stack, from framework parameters to network topology.

How It Works

SimAI integrates four core components: AICB for workload modeling, SimCCL for collective communication analysis, astra-sim-alibabacloud for network simulation, and ns-3-alibabacloud for detailed network communication modeling. This modular design allows for flexible simulation scenarios, ranging from fast analytical estimations using bus bandwidth to high-fidelity, full-stack simulations that capture intricate network behaviors. The project leverages extensions from astra-sim and integrates NCCL algorithms for realistic performance evaluation.

Quick Start & Requirements

  • Install: Clone repository, update submodules, and compile using ./scripts/build.sh -c analytical or ./scripts/build.sh -c ns3.
  • Prerequisites: Tested on GCC/G++ 9.4.0, Python 3.8.10, Ubuntu 20.04. NGC container images recommended for generation workloads.
  • Usage: Analytical mode: ./bin/SimAI_analytical -w example/workload_analytical.txt -g 9216 -g_p_s 8 -r test- -busbw example/busbw.yaml. Simulation mode requires network topology generation and specific environment variables.
  • Tutorials: Available for SimAI, aicb, SimCCL, and ns-3-alibabacloud.

Highlighted Details

  • Full-stack simulation of LLM training processes.
  • Supports analytical (fast estimation) and detailed simulation modes.
  • Beta support for physical traffic generation in CPU RDMA cluster environments.
  • Accepted paper at NSDI'25 Spring.

Maintenance & Community

  • Active community engagement with past and upcoming technical presentations and workshops.
  • Contact emails provided for questions. Community chat groups (DingTalk, WeChat) are available.

Licensing & Compatibility

  • The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The "SimAI-Physical" mode is in beta and internal testing. The README does not specify the project's license, which could impact commercial adoption.

Health Check
Last Commit

21 hours ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
10
Star History
51 stars in the last 30 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm_training_handbook by huggingface

0%
511
Handbook for large language model training methodologies
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
20 more.

alpa by alpa-projects

0.0%
3k
Auto-parallelization framework for large-scale neural network training and serving
Created 4 years ago
Updated 1 year ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), and
26 more.

ColossalAI by hpcaitech

0.1%
41k
AI system for large-scale parallel training
Created 3 years ago
Updated 18 hours ago
Feedback? Help us improve.