MegaTrain  by DLYuanGod

Train massive LLMs and VLMs on a single GPU

Created 5 days ago

New!

307 stars

Top 87.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary MegaTrain enables full-precision training of 100B+ parameter LLMs on a single GPU, addressing prohibitive hardware costs. Detailed in arXiv 2604.05091, it targets researchers and engineers needing to scale LLM training without massive distributed infrastructure, democratizing access to large model development.

How It Works A RAM-centric architecture stores parameters in host (CPU) RAM, treating GPUs as transient compute engines to overcome VRAM limitations. It employs double-buffered execution for overlapped CPU-GPU weight transfer, gradient checkpointing, and manual gradient computation. MegaTrain supports hybrid attention (linear + full) and MoE layers, automatically adapting to diverse model architectures.

Quick Start & Requirements Install via git clone https://github.com/DLYuanGod/MegaTrain.git && cd MegaTrain && pip install -e .. Requires Python 3.9+ and PyTorch 2.0+. Optional performance dependencies include flash-attn, flash-linear-attention, causal-conv1d, and deepspeed. Crucially, use scripts/calc_resource.py to determine optimal batch_size for specific hardware.

Highlighted Details

  • Enables training of 120B+ parameter models on a single GPU.
  • Supports any HuggingFace decoder-only LLM or VLM via AutoModel.
  • Handles hybrid attention (linear + full) and MoE layers automatically.
  • Claims 1.84x speedup over DeepSpeed ZeRO-3 on 14B models.
  • Features LlamaFactory-style data registry (Alpaca, ShareGPT, JSON, HF Hub).
  • Configuration via YAML files, with 25+ pre-made examples.

Maintenance & Community The README does not detail specific maintenance practices, notable contributors, sponsorships, or community channels like Discord or Slack.

Licensing & Compatibility Licensed under the Apache-2.0 License, permitting commercial use and integration into closed-source projects.

Limitations & Caveats Designed for decoder-only models; encoder-decoder architectures are unsupported. Accurate batch_size configuration via the resource calculator is critical to prevent OOM errors or inefficient utilization. The project appears research-oriented, with production stability not explicitly detailed.

Health Check
Last Commit

9 hours ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
2
Star History
315 stars in the last 5 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

0.3%
1k
LLM inference engine for diverse applications
Created 2 years ago
Updated 6 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.1%
4k
High-performance C++ LLM inference library
Created 2 years ago
Updated 2 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
23k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
41 more.

unsloth by unslothai

2.6%
61k
Finetuning tool for LLMs, targeting speed and memory efficiency
Created 2 years ago
Updated 20 hours ago
Feedback? Help us improve.