OpenMoE by XueFuzhao

Open-source MoE LLM for research

Created 2 years ago

1,652 stars

Top 25.4% on SourcePulse

View on GitHub

6 Experts Love This Project

Simon Willison

Coauthor of Django

Johannes Hagemann

Cofounder of Prime Intellect

Jeff Hammerbacher

Cofounder of Cloudera

Omar Sanseviero

DevRel at Google DeepMind

and 2 more!

Project Summary

OpenMoE provides a family of open-sourced Mixture-of-Experts (MoE) Large Language Models, aiming to foster community research in this promising area. The project offers fully shared training data, strategies, architecture, and weights, targeting researchers and developers interested in MoE LLMs.

How It Works

OpenMoE models are based on a decoder-only architecture, a departure from the encoder-decoder ST-MoE. They utilize a modified UL2 training objective initially, transitioning to next-token prediction for later stages. Key components include RoPE embeddings, SwiGLU activations, and a 2K context length. The approach emphasizes sharing intermediate checkpoints to facilitate the study of MoE training dynamics.

Quick Start & Requirements

Inference (PyTorch): Requires ColossalAI (forked version) and transformers. Install via pip install ./ColossalAI and pip install -r ./ColossalAI/examples/language/openmoe/requirements.txt. Inference example provided using transformers library.
Colab Demo: Available for JAX checkpoint conversion and PyTorch inference (requires Colab Pro).
Memory: OpenMoE-8B requires ~23GB (bfloat16) or ~49GB (float32). OpenMoE-34B requires ~89GB (bfloat16) or ~180GB (float32).
Training: Supported on TPUs (via TPU Research Cloud) and GPUs (via ColossalAI implementation).
Evaluation: Supports MT-Bench (GPU) and BIG-bench-Lite (TPU).
Links: [Homepage] | [Paper] | [Colab Demo] | [Huggingface] | [Discord] | [Twitter] | [Blog]

Highlighted Details

Offers models up to 34B parameters, trained on up to 1.1T tokens.
OpenMoE-8B-Chat outperforms dense LLMs with double the FLOPs on MT-Bench first-turn results.
Provides intermediate checkpoints (200B to 1T tokens) for studying MoE training dynamics.
Data mixture includes a high ratio of coding data, adjusted in later training stages.

Maintenance & Community

The project is driven by a student team, with active development noted in recent news. Links to Discord and Twitter are provided for community engagement.

Licensing & Compatibility

Code is licensed under Apache 2.0. Model usage is subject to the licenses of the RedPajama and The Stack datasets.

Limitations & Caveats

The README notes potential convergence issues with the current GPU training implementation (referencing GitHub issues #5163, #5212) and states that the OpenMoE-base model is for debugging only and not suitable for practical applications.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

12 stars in the last 30 days