Awesome-ML-SYS-Tutorial  by zhaochenyang20

ML SYS learning notes and code

created 8 months ago
3,103 stars

Top 15.8% on sourcepulse

GitHubView on GitHub
Project Summary

This repository serves as a comprehensive learning resource for Machine Learning Systems (ML-SYS), targeting individuals interested in bridging the gap between ML theory and practical application. It offers detailed learning notes, code examples, and analyses of key systems and techniques in the ML-SYS domain, particularly focusing on Reinforcement Learning from Human Feedback (RLHF) and efficient model serving.

How It Works

The project is structured around the author's personal learning journey, covering topics from RLHF system development (including RLHF implementation, reward modeling, and distributed training) to model serving optimization (like latency reduction and embedding model serving) and fundamental ML system concepts (such as NCCL, PyTorch Distributed, and quantization). The content is presented through a mix of original notes, code walkthroughs, and analyses of existing research and tools like SGLang, OpenRLHF, and vLLM.

Quick Start & Requirements

Highlighted Details

  • In-depth exploration of RLHF systems, including industrial-grade implementations and PPO algorithm analysis.
  • Detailed walkthroughs and code analyses of SGLang and vLLM for efficient model serving.
  • Coverage of fundamental ML system concepts like NCCL, PyTorch Distributed, and quantization methods (AWQ, BF16).
  • Practical debugging and optimization techniques for latency and weight updates.

Maintenance & Community

  • The repository is a personal learning log, with contributions welcomed via Pull Requests.
  • Links to relevant platforms like HuggingFace Blog and Zhihu are provided for related content.

Licensing & Compatibility

  • The repository itself appears to be under a permissive license, but the underlying code examples and tools referenced (SGLang, OpenRLHF, vLLM) will have their own licenses. Users must verify compatibility for commercial or closed-source use.

Limitations & Caveats

The content is presented as personal learning notes and may not represent fully production-ready solutions. Some sections are marked as incomplete or are in progress, with specific issues like NCCL hang errors being actively addressed. The author notes that some original writings were not preferred, indicating a subjective element to the content.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
14
Issues (30d)
5
Star History
1,145 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

llm_training_handbook by huggingface

0%
506
Handbook for large language model training methodologies
created 2 years ago
updated 1 year ago
Feedback? Help us improve.