MegaTrain  by DLYuanGod

Train massive LLMs and VLMs on a single GPU

Created 1 month ago
597 stars

Top 54.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary MegaTrain enables full-precision training of 100B+ parameter LLMs on a single GPU, addressing prohibitive hardware costs. Detailed in arXiv 2604.05091, it targets researchers and engineers needing to scale LLM training without massive distributed infrastructure, democratizing access to large model development.

How It Works A RAM-centric architecture stores parameters in host (CPU) RAM, treating GPUs as transient compute engines to overcome VRAM limitations. It employs double-buffered execution for overlapped CPU-GPU weight transfer, gradient checkpointing, and manual gradient computation. MegaTrain supports hybrid attention (linear + full) and MoE layers, automatically adapting to diverse model architectures.

Quick Start & Requirements Install via git clone https://github.com/DLYuanGod/MegaTrain.git && cd MegaTrain && pip install -e .. Requires Python 3.9+ and PyTorch 2.0+. Optional performance dependencies include flash-attn, flash-linear-attention, causal-conv1d, and deepspeed. Crucially, use scripts/calc_resource.py to determine optimal batch_size for specific hardware.

Highlighted Details

  • Enables training of 120B+ parameter models on a single GPU.
  • Supports any HuggingFace decoder-only LLM or VLM via AutoModel.
  • Handles hybrid attention (linear + full) and MoE layers automatically.
  • Claims 1.84x speedup over DeepSpeed ZeRO-3 on 14B models.
  • Features LlamaFactory-style data registry (Alpaca, ShareGPT, JSON, HF Hub).
  • Configuration via YAML files, with 25+ pre-made examples.

Maintenance & Community The README does not detail specific maintenance practices, notable contributors, sponsorships, or community channels like Discord or Slack.

Licensing & Compatibility Licensed under the Apache-2.0 License, permitting commercial use and integration into closed-source projects.

Limitations & Caveats Designed for decoder-only models; encoder-decoder architectures are unsupported. Accurate batch_size configuration via the resource calculator is critical to prevent OOM errors or inefficient utilization. The project appears research-oriented, with production stability not explicitly detailed.

Health Check
Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
75 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

rtp-llm by alibaba

1.0%
1k
LLM inference engine for diverse applications
Created 2 years ago
Updated 9 hours ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Ying Sheng Ying Sheng(Coauthor of SGLang).

fastllm by ztxz16

0.8%
5k
High-performance C++ LLM inference library
Created 3 years ago
Updated 10 hours ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
41 more.

unsloth by unslothai

0.5%
65k
Finetuning tool for LLMs, targeting speed and memory efficiency
Created 2 years ago
Updated 9 hours ago
Feedback? Help us improve.