Pai-Megatron-Patch  by alibaba

Training toolkit for LLMs & VLMs using Megatron

created 1 year ago
1,258 stars

Top 32.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides Pai-Megatron-Patch, a toolkit for efficiently training and predicting Large Language Models (LLMs) and Vision-Language Models (VLMs) using the Megatron framework. It targets developers seeking to optimize GPU utilization for large-scale models, offering accelerated training techniques and broad model compatibility.

How It Works

Pai-Megatron-Patch applies a "patch" philosophy, extending Megatron-LM's capabilities without invasive source code modifications. This approach ensures compatibility with Megatron-LM upgrades. It includes a model library with implementations of popular LLMs, bidirectional weight converters for Huggingface and Megatron formats, and supports FP8 acceleration via Flash Attention 2.0 and Transformer Engine.

Quick Start & Requirements

  • Installation and usage details are available via the "Quick Start" link in the README.
  • Requires a robust GPU environment, likely with CUDA support, for large-scale training. Specific model training may have additional dependencies.
  • Refer to the official documentation for detailed setup and examples.

Highlighted Details

  • Supports a wide range of LLMs including Llama, Qwen, Mistral, DeepSeek, and more.
  • Facilitates bidirectional weight conversion between Huggingface and Megatron formats.
  • Offers FP8 training acceleration with Flash Attention 2.0 and Transformer Engine.
  • Includes PPO training workflows for reinforcement learning.

Maintenance & Community

  • Developed by Alibaba Cloud's Machine Learning Platform (PAI) algorithm team.
  • Contact information via DingTalk QR code is provided.

Licensing & Compatibility

  • Licensed under the Apache License (Version 2.0).
  • May contain code from other repositories under different open-source licenses; consult the NOTICE file.

Limitations & Caveats

  • Some features are marked as experimental, such as distributed checkpoint conversion.
  • The README indicates specific model support via links, suggesting a modular or evolving integration strategy.
Health Check
Last commit

2 days ago

Responsiveness

1 week

Pull Requests (30d)
3
Issues (30d)
8
Star History
232 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

HALOs by ContextualAI

0.2%
873
Library for aligning LLMs using human-aware loss functions
created 1 year ago
updated 2 weeks ago
Starred by Lewis Tunstall Lewis Tunstall(Researcher at Hugging Face), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
5 more.

torchtune by pytorch

0.2%
5k
PyTorch library for LLM post-training and experimentation
created 1 year ago
updated 1 day ago
Feedback? Help us improve.