SimAI by aliyun

Simulator for large-scale AI training analysis and optimization

Created 1 year ago

814 stars

Top 43.4% on SourcePulse

Project Summary

SimAI is a full-stack, high-precision simulator designed for analyzing and optimizing large-scale AI training, particularly for Large Language Models (LLMs). It targets researchers and engineers seeking to understand and improve training performance by modeling various layers of the training stack, from framework parameters to network topology.

How It Works

SimAI integrates four core components: AICB for workload modeling, SimCCL for collective communication analysis, astra-sim-alibabacloud for network simulation, and ns-3-alibabacloud for detailed network communication modeling. This modular design allows for flexible simulation scenarios, ranging from fast analytical estimations using bus bandwidth to high-fidelity, full-stack simulations that capture intricate network behaviors. The project leverages extensions from astra-sim and integrates NCCL algorithms for realistic performance evaluation.

Quick Start & Requirements

Install: Clone repository, update submodules, and compile using ./scripts/build.sh -c analytical or ./scripts/build.sh -c ns3.
Prerequisites: Tested on GCC/G++ 9.4.0, Python 3.8.10, Ubuntu 20.04. NGC container images recommended for generation workloads.
Usage: Analytical mode: ./bin/SimAI_analytical -w example/workload_analytical.txt -g 9216 -g_p_s 8 -r test- -busbw example/busbw.yaml. Simulation mode requires network topology generation and specific environment variables.
Tutorials: Available for SimAI, aicb, SimCCL, and ns-3-alibabacloud.

Highlighted Details

Full-stack simulation of LLM training processes.
Supports analytical (fast estimation) and detailed simulation modes.
Beta support for physical traffic generation in CPU RDMA cluster environments.
Accepted paper at NSDI'25 Spring.

Maintenance & Community

Active community engagement with past and upcoming technical presentations and workshops.
Contact emails provided for questions. Community chat groups (DingTalk, WeChat) are available.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The "SimAI-Physical" mode is in beta and internal testing. The README does not specify the project's license, which could impact commercial adoption.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

3

Issues (30d)

6

Star History

14 stars in the last 30 days

Explore Similar Projects

Starred by

Zack Li

Zack Li(Cofounder of Nexa AI) and

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

Efficient_Foundation_Model_Survey by UbiquitousLearning

Survey paper for resource-efficient LLMs and multimodal foundation models

Created 2 years ago

Updated 1 year ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

ReaLHF by openpsi-project

Efficient RLHF training system for LLMs using parameter reallocation

Created 1 year ago

Updated 10 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind) and

Stas Bekman

Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm_training_handbook by huggingface

Handbook for large language model training methodologies

Created 3 years ago

Updated 2 years ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen),

Casper Hansen

Casper Hansen(Author of AutoAWQ), and

4 more.

veScale by volcengine

PyTorch-native framework for LLM training

Created 2 years ago

Updated 3 months ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera) and

Zhuohan Li

Zhuohan Li(Coauthor of vLLM).

paxml by google

Jax-based ML framework for large-scale model training and experimentation

Created 3 years ago

Updated 1 week ago

Starred by

Aravind Srinivas

Aravind Srinivas(Cofounder of Perplexity),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

7 more.

grokking-pytorch by Kaixhin

PyTorch guide with notes on usage, best practices, and debugging

Created 7 years ago

Updated 4 years ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

BMTrain by OpenBMB

Training toolkit for large AI models

Created 4 years ago

Updated 4 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI), and

20 more.

alpa by alpa-projects

Auto-parallelization framework for large-scale neural network training and serving

Created 5 years ago

Updated 2 years ago

Starred by

Aravind Srinivas

Aravind Srinivas(Cofounder of Perplexity),

Eiso Kant

Eiso Kant(Cofounder of Poolside AI), and

20 more.

composer by mosaicml

DL framework for training at scale, optimized for large-scale clusters

Created 4 years ago

Updated 3 months ago

Starred by

Chaoyu Yang

Chaoyu Yang(Founder of Bento),

Luis Capelo

Luis Capelo(Cofounder of Lightning AI), and

4 more.

tensorforce by tensorforce

TensorFlow library for reinforcement learning (not maintained)

Created 9 years ago

Updated 1 year ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI),

Chaoyu Yang

Chaoyu Yang(Founder of Bento), and

2 more.

oneflow by Oneflow-Inc

Deep learning framework for user-friendly, scalable, efficient model development

Created 9 years ago

Updated 2 months ago

Starred by

Tobi Lutke

Tobi Lutke(Cofounder of Shopify),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

28 more.

ColossalAI by hpcaitech

AI system for large-scale parallel training

Created 4 years ago

Updated 2 days ago

Feedback? Help us improve.