MoE model and optimizer tech report
Top 32.5% on sourcepulse
This repository provides the Moonlight LLM, a 16B parameter Mixture-of-Experts model trained using the Muon optimizer. It aims to advance the Pareto frontier of performance versus training FLOPs, offering improved computational efficiency and strong benchmark results across various tasks, including English, code, math, and Chinese evaluations. The project is suitable for researchers and developers looking for high-performance LLMs with efficient training methodologies.
How It Works
Moonlight leverages the Muon optimizer, enhanced with weight decay and per-parameter update scale adjustments for improved stability and scalability. The implementation is distributed with ZeRO-1 style optimizations for memory efficiency and reduced communication overhead. This approach allows Muon to achieve comparable performance to AdamW with approximately 52% of the training FLOPs, demonstrating significant computational gains.
Quick Start & Requirements
python=3.10
, torch>=2.1.0
, transformers=4.48.2
.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "moonshotai/Moonlight-16B-A3B"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
Highlighted Details
Maintenance & Community
The project is associated with MoonshotAI. Intermediate checkpoints are planned for release. Citation details for the accompanying paper are provided.
Licensing & Compatibility
The repository does not explicitly state a license in the README. The model weights are available via Hugging Face, typically under a specific license defined on the model card. Compatibility for commercial use would require checking the specific model license.
Limitations & Caveats
The README mentions that intermediate checkpoints are "coming soon," indicating they are not yet available. The specific license for the model weights and code is not clearly stated in the provided text, which could impact commercial adoption.
4 months ago
1 day