nmoe  by Noumena-Network

High-performance MoE trainer for NVIDIA B200 GPUs

Created 4 months ago
382 stars

Top 74.5% on SourcePulse

GitHubView on GitHub
Project Summary

Noumena-Network/nmoe provides an opinionated Mixture-of-Experts (MoE) trainer specifically engineered for NVIDIA Blackwell B200 GPUs. It addresses the performance bottlenecks in traditional MoE training by implementing a novel expert parallelism strategy, RDEP, enabling efficient large-scale model training for researchers and power users.

How It Works

The core innovation is RDEP (Remote Direct Memory Access Event-driven Parallelism), which replaces standard NCCL all-to-all collectives for expert communication. Instead of global synchronization, RDEP dispatches tokens directly to expert owners using NVSHMEM for inter-node and CUDA IPC for intra-node communication. This direct put-based approach eliminates collective barriers and waiting, significantly improving communication efficiency and throughput for MoE layers.

Quick Start & Requirements

This repository is container-first, requiring Docker. The primary prerequisite is NVIDIA Blackwell B200 GPUs (sm_100a).

  • Install: Build Docker images using docker/Dockerfile.base and docker/Dockerfile.train.
  • Single-GPU: docker run --gpus all -v /data:/data xjdr/nmoe_train:latest python -m nmoe.train configs/moonlet.toml
  • Multi-GPU (8x): torchrun --standalone --nproc_per_node=8 -m nmoe.train configs/moonlight.toml
  • Multi-node: Build xjdr/nmoe_dist:latest (requires NVSHMEM) and use k8s manifests.
  • Data: Requires pre-tokenized .npy shards; preprocessing example: python -m nmoe.data.cli prep --source hf --dataset HuggingFaceFW/fineweb-edu --output /data/fineweb_edu --name fineweb_edu.
  • Docs: nmoe/data/README.md, nviz/README.md.

Highlighted Details

  • RDEP Kernels: Fused dispatch/return via NVSHMEM/IPC.
  • Optimized Paths: BF16 and blockscaled (FP8/NVFP4) support.
  • Grouped GEMMs: SM100-optimized via CuTe DSL, leveraging cuBLASLt.
  • Deterministic Resume: Checkpoints capture RNG state, shard cursor, and config fingerprint.
  • HYDRA: Integrated LLM-as-judge data quality pipeline.
  • NVIZ: Included dashboard for visualizing training metrics.

Maintenance & Community

The project adopts a narrow, opinionated stance, focusing on specific hardware and parallelism strategies. No explicit community channels (Discord/Slack) or roadmap links are provided in the README.

Licensing & Compatibility

  • License: Apache-2.0.
  • Compatibility: Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Strictly limited to NVIDIA Blackwell B200 (sm_100a) hardware; no support for H100/A100 or fallback paths. Tensor parallelism is not implemented, and NCCL all-to-all is explicitly excluded from the MoE communication path.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
20 more.

accelerate by huggingface

0.2%
10k
PyTorch training helper for distributed execution
Created 5 years ago
Updated 22 hours ago
Starred by Peter Norvig Peter Norvig(Author of "Artificial Intelligence: A Modern Approach"; Research Director at Google), Alexey Milovidov Alexey Milovidov(Cofounder of Clickhouse), and
29 more.

llm.c by karpathy

0.3%
30k
LLM training in pure C/CUDA, no PyTorch needed
Created 2 years ago
Updated 10 months ago
Feedback? Help us improve.