OpenMoE  by XueFuzhao

Open-source MoE LLM for research

created 2 years ago
1,568 stars

Top 27.2% on sourcepulse

GitHubView on GitHub
Project Summary

OpenMoE provides a family of open-sourced Mixture-of-Experts (MoE) Large Language Models, aiming to foster community research in this promising area. The project offers fully shared training data, strategies, architecture, and weights, targeting researchers and developers interested in MoE LLMs.

How It Works

OpenMoE models are based on a decoder-only architecture, a departure from the encoder-decoder ST-MoE. They utilize a modified UL2 training objective initially, transitioning to next-token prediction for later stages. Key components include RoPE embeddings, SwiGLU activations, and a 2K context length. The approach emphasizes sharing intermediate checkpoints to facilitate the study of MoE training dynamics.

Quick Start & Requirements

  • Inference (PyTorch): Requires ColossalAI (forked version) and transformers. Install via pip install ./ColossalAI and pip install -r ./ColossalAI/examples/language/openmoe/requirements.txt. Inference example provided using transformers library.
  • Colab Demo: Available for JAX checkpoint conversion and PyTorch inference (requires Colab Pro).
  • Memory: OpenMoE-8B requires ~23GB (bfloat16) or ~49GB (float32). OpenMoE-34B requires ~89GB (bfloat16) or ~180GB (float32).
  • Training: Supported on TPUs (via TPU Research Cloud) and GPUs (via ColossalAI implementation).
  • Evaluation: Supports MT-Bench (GPU) and BIG-bench-Lite (TPU).
  • Links: [Homepage] | [Paper] | [Colab Demo] | [Huggingface] | [Discord] | [Twitter] | [Blog]

Highlighted Details

  • Offers models up to 34B parameters, trained on up to 1.1T tokens.
  • OpenMoE-8B-Chat outperforms dense LLMs with double the FLOPs on MT-Bench first-turn results.
  • Provides intermediate checkpoints (200B to 1T tokens) for studying MoE training dynamics.
  • Data mixture includes a high ratio of coding data, adjusted in later training stages.

Maintenance & Community

The project is driven by a student team, with active development noted in recent news. Links to Discord and Twitter are provided for community engagement.

Licensing & Compatibility

Code is licensed under Apache 2.0. Model usage is subject to the licenses of the RedPajama and The Stack datasets.

Limitations & Caveats

The README notes potential convergence issues with the current GPU training implementation (referencing GitHub issues #5163, #5212) and states that the OpenMoE-base model is for debugging only and not suitable for practical applications.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
56 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), and
10 more.

TinyLlama by jzhang38

0.3%
9k
Tiny pretraining project for a 1.1B Llama model
created 1 year ago
updated 1 year ago
Feedback? Help us improve.