Moonlight  by MoonshotAI

MoE model and optimizer tech report

created 5 months ago
1,235 stars

Top 32.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the Moonlight LLM, a 16B parameter Mixture-of-Experts model trained using the Muon optimizer. It aims to advance the Pareto frontier of performance versus training FLOPs, offering improved computational efficiency and strong benchmark results across various tasks, including English, code, math, and Chinese evaluations. The project is suitable for researchers and developers looking for high-performance LLMs with efficient training methodologies.

How It Works

Moonlight leverages the Muon optimizer, enhanced with weight decay and per-parameter update scale adjustments for improved stability and scalability. The implementation is distributed with ZeRO-1 style optimizations for memory efficiency and reduced communication overhead. This approach allows Muon to achieve comparable performance to AdamW with approximately 52% of the training FLOPs, demonstrating significant computational gains.

Quick Start & Requirements

  • Inference: Use Hugging Face Transformers. Recommended environment: python=3.10, torch>=2.1.0, transformers=4.48.2.
    from transformers import AutoModelForCausalLM, AutoTokenizer
    model_path = "moonshotai/Moonlight-16B-A3B"
    model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    
  • Training: Example commands provided for training dense models with Muon or AdamW.
  • Dependencies: PyTorch, Transformers.
  • Resources: Requires significant GPU resources for training and inference.
  • Links: Hugging Face Model, Tech Report

Highlighted Details

  • Achieves ∼2x computational efficiency compared to AdamW with compute-optimal training.
  • Moonlight (16B MoE) outperforms Llama3.2-3B and Qwen2.5-3B on MMLU (70.0 vs 54.75/65.6) and other benchmarks.
  • Trained on 5.7T tokens, achieving state-of-the-art performance for its scale.
  • Compatible with inference engines like VLLM and SGLang.

Maintenance & Community

The project is associated with MoonshotAI. Intermediate checkpoints are planned for release. Citation details for the accompanying paper are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. The model weights are available via Hugging Face, typically under a specific license defined on the model card. Compatibility for commercial use would require checking the specific model license.

Limitations & Caveats

The README mentions that intermediate checkpoints are "coming soon," indicating they are not yet available. The specific license for the model weights and code is not clearly stated in the provided text, which could impact commercial adoption.

Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
3
Star History
205 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

fms-fsdp by foundation-model-stack

0.4%
258
Efficiently train foundation models with PyTorch
created 1 year ago
updated 1 week ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), and
10 more.

TinyLlama by jzhang38

0.3%
9k
Tiny pretraining project for a 1.1B Llama model
created 1 year ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley).

DeepSeek-Coder-V2 by deepseek-ai

0.4%
6k
Open-source code language model comparable to GPT4-Turbo
created 1 year ago
updated 10 months ago
Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Calvin French-Owen Calvin French-Owen(Coounder of Segment), and
12 more.

StableLM by Stability-AI

0.0%
16k
Language models by Stability AI
created 2 years ago
updated 1 year ago
Feedback? Help us improve.