Moonlight by MoonshotAI

MoE model and optimizer tech report

Created 10 months ago

1,398 stars

Top 28.8% on SourcePulse

View on GitHub

3 Experts Love This Project

Jeremy Howard

Cofounder of fast.ai

Phil Wang

Prolific Research Paper Implementer

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Project Summary

This repository provides the Moonlight LLM, a 16B parameter Mixture-of-Experts model trained using the Muon optimizer. It aims to advance the Pareto frontier of performance versus training FLOPs, offering improved computational efficiency and strong benchmark results across various tasks, including English, code, math, and Chinese evaluations. The project is suitable for researchers and developers looking for high-performance LLMs with efficient training methodologies.

How It Works

Moonlight leverages the Muon optimizer, enhanced with weight decay and per-parameter update scale adjustments for improved stability and scalability. The implementation is distributed with ZeRO-1 style optimizations for memory efficiency and reduced communication overhead. This approach allows Muon to achieve comparable performance to AdamW with approximately 52% of the training FLOPs, demonstrating significant computational gains.

Quick Start & Requirements

Inference: Use Hugging Face Transformers. Recommended environment: python=3.10, torch>=2.1.0, transformers=4.48.2.

from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "moonshotai/Moonlight-16B-A3B"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

Training: Example commands provided for training dense models with Muon or AdamW.
Dependencies: PyTorch, Transformers.
Resources: Requires significant GPU resources for training and inference.
Links: Hugging Face Model, Tech Report

Highlighted Details

Achieves ∼2x computational efficiency compared to AdamW with compute-optimal training.
Moonlight (16B MoE) outperforms Llama3.2-3B and Qwen2.5-3B on MMLU (70.0 vs 54.75/65.6) and other benchmarks.
Trained on 5.7T tokens, achieving state-of-the-art performance for its scale.
Compatible with inference engines like VLLM and SGLang.

Maintenance & Community

The project is associated with MoonshotAI. Intermediate checkpoints are planned for release. Citation details for the accompanying paper are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. The model weights are available via Hugging Face, typically under a specific license defined on the model card. Compatibility for commercial use would require checking the specific model license.

Limitations & Caveats

The README mentions that intermediate checkpoints are "coming soon," indicating they are not yet available. The specific license for the model weights and code is not clearly stated in the provided text, which could impact commercial adoption.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

22 stars in the last 30 days