Awesome-ML-SYS-Tutorial by zhaochenyang20

ML SYS learning notes and code

Created 1 year ago

5,005 stars

Top 9.9% on SourcePulse

View on GitHub

5 Experts Love This Project

Lilian Weng

Cofounder of Thinking Machines Lab

Yiran Wu

Coauthor of AutoGen

Yaowei Zheng

Author of LLaMA-Factory

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

and 1 more!

Project Summary

This repository serves as a comprehensive learning resource for Machine Learning Systems (ML-SYS), targeting individuals interested in bridging the gap between ML theory and practical application. It offers detailed learning notes, code examples, and analyses of key systems and techniques in the ML-SYS domain, particularly focusing on Reinforcement Learning from Human Feedback (RLHF) and efficient model serving.

How It Works

The project is structured around the author's personal learning journey, covering topics from RLHF system development (including RLHF implementation, reward modeling, and distributed training) to model serving optimization (like latency reduction and embedding model serving) and fundamental ML system concepts (such as NCCL, PyTorch Distributed, and quantization). The content is presented through a mix of original notes, code walkthroughs, and analyses of existing research and tools like SGLang, OpenRLHF, and vLLM.

Quick Start & Requirements

Installation: Primarily involves cloning the repository and following individual code examples.
Prerequisites: Python, PyTorch, and specific libraries mentioned within each section (e.g., SGLang, vLLM, DeepSpeed). Familiarity with ML concepts and system design is beneficial.
Resources: Varies by section; some examples may require significant computational resources (e.g., GPUs for RLHF training, model serving).
Links:
- English README: https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial
- Chinese README: https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/README.md

Highlighted Details

In-depth exploration of RLHF systems, including industrial-grade implementations and PPO algorithm analysis.
Detailed walkthroughs and code analyses of SGLang and vLLM for efficient model serving.
Coverage of fundamental ML system concepts like NCCL, PyTorch Distributed, and quantization methods (AWQ, BF16).
Practical debugging and optimization techniques for latency and weight updates.

Maintenance & Community

The repository is a personal learning log, with contributions welcomed via Pull Requests.
Links to relevant platforms like HuggingFace Blog and Zhihu are provided for related content.

Licensing & Compatibility

The repository itself appears to be under a permissive license, but the underlying code examples and tools referenced (SGLang, OpenRLHF, vLLM) will have their own licenses. Users must verify compatibility for commercial or closed-source use.

Limitations & Caveats

The content is presented as personal learning notes and may not represent fully production-ready solutions. Some sections are marked as incomplete or are in progress, with specific issues like NCCL hang errors being actively addressed. The author notes that some original writings were not preferred, indicating a subjective element to the content.

Health Check

Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

580 stars in the last 30 days