GRIN-MoE  by microsoft

MoE for code and math, using gradient-informed routing

Created 1 year ago
264 stars

Top 96.8% on SourcePulse

GitHubView on GitHub
Project Summary

GRIN-MoE is a 6.6B active parameter Mixture-of-Experts (MoE) language model designed for memory/compute-constrained and latency-bound environments, excelling in coding and mathematics tasks. It targets researchers and developers building generative AI applications requiring strong reasoning capabilities.

How It Works

GRIN-MoE employs SparseMixer-v2 for gradient-informed expert routing, a departure from conventional MoE training that uses gating as a proxy. This approach enables efficient scaling without expert parallelism or token dropping, leading to improved performance with fewer active parameters.

Quick Start & Requirements

  • Inference Demo: curl https://raw.githubusercontent.com/microsoft/GRIN-MoE/main/demo/demo.sh | bash -s (requires Docker)
  • Interactive Demo: Launch a Jupyter notebook via Docker: docker run --gpus all -p 8887:8887 --rm nvcr.io/nvidia/pytorch:24.08-py3 /bin/bash -c 'git clone https://github.com/microsoft/GRIN-MoE.git && jupyter notebook --port 8887 --notebook-dir GRIN-MoE/demo'
  • Prerequisites: Docker, NVIDIA GPUs (for demo scripts).

Highlighted Details

  • Achieves 79.6 average score across popular benchmarks, outperforming Mixtral 8x7B and Llama3 8B.
  • Demonstrates strong performance in coding (HumanEval: 74.4, MBPP: 80.3) and mathematics (GSM-8K: 90.4).
  • Trained on 4.0T tokens, including high-quality educational and synthetic data.
  • Context length is 4K tokens.

Maintenance & Community

Licensing & Compatibility

  • Licensed under the MIT license, permitting commercial use and modification.

Limitations & Caveats

The model is primarily trained on English and may exhibit reduced performance on other languages or English dialects with less representation. It can perpetuate societal biases and generate inaccurate or offensive content, requiring careful evaluation and mitigation for sensitive applications. Code generation is primarily focused on Python.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Benjamin Bolte Benjamin Bolte(Cofounder of K-Scale Labs), Albert Gu Albert Gu(Cofounder of Cartesia; Professor at CMU), and
2 more.

Muon by KellerJordan

1.7%
2k
Optimizer for neural network hidden layers
Created 10 months ago
Updated 2 months ago
Feedback? Help us improve.