surogate by invergent-ai

High-performance AI model training and fine-tuning

Created 6 months ago

806 stars

Top 43.0% on SourcePulse

Project Summary

Surogate Trainer is a high-performance framework designed for rapid experimentation in training and fine-tuning large language models, targeting developers and enterprises. It offers significant speedups and VRAM efficiency through advanced quantization techniques like FP8 and FP4, alongside a native C++/CUDA engine, aiming to surpass existing training frameworks.

How It Works

Surogate leverages a native C++/CUDA engine for "Speed-Of-Light" (SOL) throughput, enabling advanced FP8 and FP4 (NVFP4) training and fine-tuning. Its core design includes smart CPU offloading for weights, gradients, and activations, which provides superior VRAM usage and performance compared to methods like QLoRA. The framework supports mixed-precision training and offers a Python DSL with Ahead-of-Time (AOT) auto-differentiation for integrating new model architectures.

Quick Start & Requirements

Installation: Options include Docker (recommended, with specific CUDA versions), a install.sh script (auto-detects CUDA), or building from source.
Prerequisites: NVIDIA GPU with a recent driver, CUDA 12.8/12.9/13.x, NCCL, and cuDNN are required. The system must be Linux x86_64.
Hardware: Supports NVIDIA GPUs from SM80 (e.g., A100) up to SM121 (Blackwell). Note that FP4 training specifically requires Blackwell+ GPUs (SM100+).
Links: Documentation is available at https://docs.surogate.ai, and examples can be found at https://github.com/invergent-ai/surogate/tree/master/examples.

Highlighted Details

Supports native FP8 (E4M3/E5M2) and FP4 (E2M1) training with block scaling for extreme performance and memory efficiency.
Features native CPU offloading for weights, gradients, activations, and quants, outperforming QLoRA in VRAM usage and speed.
Includes advanced adaptive training capabilities such as automated monitoring, multi-criteria early stopping, auto LR management, and MoE imbalance detection.
Supports a wide range of models including Qwen3/3.5, Nemotron, GPT-OSS, and Llama 3.1/3.2.

Maintenance & Community

The project appears actively maintained, with support for recent hardware and models. Community interaction is primarily through their Twitter handle @surogate_ai. Contributions are welcomed via GitHub pull requests.

Licensing & Compatibility

The project is licensed under the Apache 2.0 license, which is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Native FP4 training (NVFP4) is explicitly stated to require Blackwell+ GPUs (SM100+). Docker images are currently limited to the x86_64 architecture. Users requiring support for specific model architectures not yet listed are encouraged to contribute via pull requests.

surogate by invergent-ai

Explore Similar Projects

varuna by microsoft

Instella by AMD-AGI

LoongForge by baidu-baige

MegaDLMs by JinjieNi

LLamaTuner by jianzhnie

ai-infra-hpc by jinbooooom

InternEvo by InternLM

FlagPerf by flagos-ai

SimpleTuner by bghira

neon by NervanaSystems

accelerate by huggingface

pytorch-lightning by Lightning-AI