surogate  by invergent-ai

High-performance AI model training and fine-tuning

Created 5 months ago
795 stars

Top 43.8% on SourcePulse

GitHubView on GitHub
Project Summary

Surogate Trainer is a high-performance framework designed for rapid experimentation in training and fine-tuning large language models, targeting developers and enterprises. It offers significant speedups and VRAM efficiency through advanced quantization techniques like FP8 and FP4, alongside a native C++/CUDA engine, aiming to surpass existing training frameworks.

How It Works

Surogate leverages a native C++/CUDA engine for "Speed-Of-Light" (SOL) throughput, enabling advanced FP8 and FP4 (NVFP4) training and fine-tuning. Its core design includes smart CPU offloading for weights, gradients, and activations, which provides superior VRAM usage and performance compared to methods like QLoRA. The framework supports mixed-precision training and offers a Python DSL with Ahead-of-Time (AOT) auto-differentiation for integrating new model architectures.

Quick Start & Requirements

  • Installation: Options include Docker (recommended, with specific CUDA versions), a install.sh script (auto-detects CUDA), or building from source.
  • Prerequisites: NVIDIA GPU with a recent driver, CUDA 12.8/12.9/13.x, NCCL, and cuDNN are required. The system must be Linux x86_64.
  • Hardware: Supports NVIDIA GPUs from SM80 (e.g., A100) up to SM121 (Blackwell). Note that FP4 training specifically requires Blackwell+ GPUs (SM100+).
  • Links: Documentation is available at https://docs.surogate.ai, and examples can be found at https://github.com/invergent-ai/surogate/tree/master/examples.

Highlighted Details

  • Supports native FP8 (E4M3/E5M2) and FP4 (E2M1) training with block scaling for extreme performance and memory efficiency.
  • Features native CPU offloading for weights, gradients, activations, and quants, outperforming QLoRA in VRAM usage and speed.
  • Includes advanced adaptive training capabilities such as automated monitoring, multi-criteria early stopping, auto LR management, and MoE imbalance detection.
  • Supports a wide range of models including Qwen3/3.5, Nemotron, GPT-OSS, and Llama 3.1/3.2.

Maintenance & Community

The project appears actively maintained, with support for recent hardware and models. Community interaction is primarily through their Twitter handle @surogate_ai. Contributions are welcomed via GitHub pull requests.

Licensing & Compatibility

The project is licensed under the Apache 2.0 license, which is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Native FP4 training (NVFP4) is explicitly stated to require Blackwell+ GPUs (SM100+). Docker images are currently limited to the x86_64 architecture. Users requiring support for specific model architectures not yet listed are encouraged to contribute via pull requests.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
631 stars in the last 30 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

InternEvo by InternLM

0%
420
Lightweight training framework for model pre-training
Created 2 years ago
Updated 9 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Sebastian Raschka Sebastian Raschka(Author of "Build a Large Language Model (From Scratch)"), and
2 more.

SimpleTuner by bghira

0.2%
3k
Fine-tuning kit for diffusion models
Created 3 years ago
Updated 1 day ago
Starred by François Chollet François Chollet(Author of Keras; Cofounder of Ndea, ARC Prize), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
13 more.

neon by NervanaSystems

0%
4k
Deep learning framework (discontinued)
Created 11 years ago
Updated 5 years ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
20 more.

accelerate by huggingface

0.1%
10k
PyTorch training helper for distributed execution
Created 5 years ago
Updated 1 day ago
Feedback? Help us improve.