Moonlight  by MoonshotAI

MoE model and optimizer tech report

Created 6 months ago
1,311 stars

Top 30.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the Moonlight LLM, a 16B parameter Mixture-of-Experts model trained using the Muon optimizer. It aims to advance the Pareto frontier of performance versus training FLOPs, offering improved computational efficiency and strong benchmark results across various tasks, including English, code, math, and Chinese evaluations. The project is suitable for researchers and developers looking for high-performance LLMs with efficient training methodologies.

How It Works

Moonlight leverages the Muon optimizer, enhanced with weight decay and per-parameter update scale adjustments for improved stability and scalability. The implementation is distributed with ZeRO-1 style optimizations for memory efficiency and reduced communication overhead. This approach allows Muon to achieve comparable performance to AdamW with approximately 52% of the training FLOPs, demonstrating significant computational gains.

Quick Start & Requirements

  • Inference: Use Hugging Face Transformers. Recommended environment: python=3.10, torch>=2.1.0, transformers=4.48.2.
    from transformers import AutoModelForCausalLM, AutoTokenizer
    model_path = "moonshotai/Moonlight-16B-A3B"
    model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    
  • Training: Example commands provided for training dense models with Muon or AdamW.
  • Dependencies: PyTorch, Transformers.
  • Resources: Requires significant GPU resources for training and inference.
  • Links: Hugging Face Model, Tech Report

Highlighted Details

  • Achieves ∼2x computational efficiency compared to AdamW with compute-optimal training.
  • Moonlight (16B MoE) outperforms Llama3.2-3B and Qwen2.5-3B on MMLU (70.0 vs 54.75/65.6) and other benchmarks.
  • Trained on 5.7T tokens, achieving state-of-the-art performance for its scale.
  • Compatible with inference engines like VLLM and SGLang.

Maintenance & Community

The project is associated with MoonshotAI. Intermediate checkpoints are planned for release. Citation details for the accompanying paper are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. The model weights are available via Hugging Face, typically under a specific license defined on the model card. Compatibility for commercial use would require checking the specific model license.

Limitations & Caveats

The README mentions that intermediate checkpoints are "coming soon," indicating they are not yet available. The specific license for the model weights and code is not clearly stated in the provided text, which could impact commercial adoption.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
39 stars in the last 30 days

Explore Similar Projects

Starred by Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
1 more.

VeOmni by ByteDance-Seed

3.4%
1k
Framework for scaling multimodal model training across accelerators
Created 5 months ago
Updated 3 weeks ago
Feedback? Help us improve.