nmoe  by Noumena-Network

High-performance MoE trainer for NVIDIA B200 GPUs

Created 3 weeks ago

New!

316 stars

Top 85.6% on SourcePulse

GitHubView on GitHub
Project Summary

Noumena-Network/nmoe provides an opinionated Mixture-of-Experts (MoE) trainer specifically engineered for NVIDIA Blackwell B200 GPUs. It addresses the performance bottlenecks in traditional MoE training by implementing a novel expert parallelism strategy, RDEP, enabling efficient large-scale model training for researchers and power users.

How It Works

The core innovation is RDEP (Remote Direct Memory Access Event-driven Parallelism), which replaces standard NCCL all-to-all collectives for expert communication. Instead of global synchronization, RDEP dispatches tokens directly to expert owners using NVSHMEM for inter-node and CUDA IPC for intra-node communication. This direct put-based approach eliminates collective barriers and waiting, significantly improving communication efficiency and throughput for MoE layers.

Quick Start & Requirements

This repository is container-first, requiring Docker. The primary prerequisite is NVIDIA Blackwell B200 GPUs (sm_100a).

  • Install: Build Docker images using docker/Dockerfile.base and docker/Dockerfile.train.
  • Single-GPU: docker run --gpus all -v /data:/data xjdr/nmoe_train:latest python -m nmoe.train configs/moonlet.toml
  • Multi-GPU (8x): torchrun --standalone --nproc_per_node=8 -m nmoe.train configs/moonlight.toml
  • Multi-node: Build xjdr/nmoe_dist:latest (requires NVSHMEM) and use k8s manifests.
  • Data: Requires pre-tokenized .npy shards; preprocessing example: python -m nmoe.data.cli prep --source hf --dataset HuggingFaceFW/fineweb-edu --output /data/fineweb_edu --name fineweb_edu.
  • Docs: nmoe/data/README.md, nviz/README.md.

Highlighted Details

  • RDEP Kernels: Fused dispatch/return via NVSHMEM/IPC.
  • Optimized Paths: BF16 and blockscaled (FP8/NVFP4) support.
  • Grouped GEMMs: SM100-optimized via CuTe DSL, leveraging cuBLASLt.
  • Deterministic Resume: Checkpoints capture RNG state, shard cursor, and config fingerprint.
  • HYDRA: Integrated LLM-as-judge data quality pipeline.
  • NVIZ: Included dashboard for visualizing training metrics.

Maintenance & Community

The project adopts a narrow, opinionated stance, focusing on specific hardware and parallelism strategies. No explicit community channels (Discord/Slack) or roadmap links are provided in the README.

Licensing & Compatibility

  • License: Apache-2.0.
  • Compatibility: Permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Strictly limited to NVIDIA Blackwell B200 (sm_100a) hardware; no support for H100/A100 or fallback paths. Tensor parallelism is not implemented, and NCCL all-to-all is explicitly excluded from the MoE communication path.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
8
Issues (30d)
6
Star History
318 stars in the last 25 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

InternEvo by InternLM

0.2%
417
Lightweight training framework for model pre-training
Created 2 years ago
Updated 4 months ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
20 more.

accelerate by huggingface

0.1%
9k
PyTorch training helper for distributed execution
Created 5 years ago
Updated 2 days ago
Starred by Peter Norvig Peter Norvig(Author of "Artificial Intelligence: A Modern Approach"; Research Director at Google), Alexey Milovidov Alexey Milovidov(Cofounder of Clickhouse), and
29 more.

llm.c by karpathy

0.2%
29k
LLM training in pure C/CUDA, no PyTorch needed
Created 1 year ago
Updated 6 months ago
Feedback? Help us improve.