nmoe by Noumena-Network

High-performance MoE trainer for NVIDIA B200 GPUs

Created 2 months ago

361 stars

Top 78.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Pawel Garbacki

Cofounder of Fireworks AI

Will Brown

Research Lead at Prime Intellect

Project Summary

Noumena-Network/nmoe provides an opinionated Mixture-of-Experts (MoE) trainer specifically engineered for NVIDIA Blackwell B200 GPUs. It addresses the performance bottlenecks in traditional MoE training by implementing a novel expert parallelism strategy, RDEP, enabling efficient large-scale model training for researchers and power users.

How It Works

The core innovation is RDEP (Remote Direct Memory Access Event-driven Parallelism), which replaces standard NCCL all-to-all collectives for expert communication. Instead of global synchronization, RDEP dispatches tokens directly to expert owners using NVSHMEM for inter-node and CUDA IPC for intra-node communication. This direct put-based approach eliminates collective barriers and waiting, significantly improving communication efficiency and throughput for MoE layers.

Quick Start & Requirements

This repository is container-first, requiring Docker. The primary prerequisite is NVIDIA Blackwell B200 GPUs (sm_100a).

Install: Build Docker images using docker/Dockerfile.base and docker/Dockerfile.train.
Single-GPU: docker run --gpus all -v /data:/data xjdr/nmoe_train:latest python -m nmoe.train configs/moonlet.toml
Multi-GPU (8x): torchrun --standalone --nproc_per_node=8 -m nmoe.train configs/moonlight.toml
Multi-node: Build xjdr/nmoe_dist:latest (requires NVSHMEM) and use k8s manifests.
Data: Requires pre-tokenized .npy shards; preprocessing example: python -m nmoe.data.cli prep --source hf --dataset HuggingFaceFW/fineweb-edu --output /data/fineweb_edu --name fineweb_edu.
Docs: nmoe/data/README.md, nviz/README.md.

Highlighted Details

RDEP Kernels: Fused dispatch/return via NVSHMEM/IPC.
Optimized Paths: BF16 and blockscaled (FP8/NVFP4) support.
Grouped GEMMs: SM100-optimized via CuTe DSL, leveraging cuBLASLt.
Deterministic Resume: Checkpoints capture RNG state, shard cursor, and config fingerprint.
HYDRA: Integrated LLM-as-judge data quality pipeline.
NVIZ: Included dashboard for visualizing training metrics.

Maintenance & Community

The project adopts a narrow, opinionated stance, focusing on specific hardware and parallelism strategies. No explicit community channels (Discord/Slack) or roadmap links are provided in the README.

Licensing & Compatibility

License: Apache-2.0.
Compatibility: Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Strictly limited to NVIDIA Blackwell B200 (sm_100a) hardware; no support for H100/A100 or fallback paths. Tensor parallelism is not implemented, and NCCL all-to-all is explicitly excluded from the MoE communication path.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

31 stars in the last 30 days