miles by radixark

Enterprise RL for large-scale MoE models

Created 5 months ago

972 stars

Top 38.0% on SourcePulse

View on GitHub

9 Experts Love This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Research Engineer at Hugging Face

and 5 more!

Project Summary

A reinforcement learning framework for large-scale MoE post-training and production workloads, Miles addresses the need for stable, controllable RL on new hardware and in production environments. Forked from and co-evolving with slime, it targets enterprise users and researchers seeking repeatable, auditable, and high-stakes experimentation with large models.

How It Works

Miles inherits slime's modular, decoupled architecture, separating training (Megatron), rollout/sample generation (SGLang + router), and data management (Data Buffer). This design allows independent scaling and customization of training and rollout engines, facilitating algorithm swaps without touching core code. It leverages advanced techniques like FlashAttention-3, DeepGEMM, batch-invariant kernels, and torch.compile for performance and numerical alignment between training and inference.

Quick Start & Requirements

Miles is under active development, with commands and examples subject to change. Users are directed to a "Quick Start Guide" and provided "examples" for environment setup, data preparation, and training startup. Support for specific hardware like "GB300" is highlighted. A pre-commit hook is mentioned, suggesting a Python development environment. Official documentation links are pending.

Highlighted Details

True On-Policy: Provides infrastructure-level support for on-policy RL with SGLang + FSDP, minimizing training-inference mismatch for enhanced repeatability and auditability.
Memory Robustness & Efficiency: Implements graceful OOM handling, memory margins, FSDP memory fixes, and offloading strategies to maximize GPU utilization and stability for large MoE jobs.
Speculative Training: Enhances RL by performing online SFT on a draft model during training, preventing policy drift and achieving over 25% rollout speedup.
Hardware & Examples: Offers GB300 training support and includes a formal mathematics (Lean) example showcasing SFT/RL scripts in a verifiable setting.

Maintenance & Community

Contributions are welcomed, particularly for new hardware backends, MoE RL recipes, stability improvements, and multimodal/speculative training use cases. Links to the slime GitHub repository are provided. Specific community channels or maintainer details are not detailed in the README.

Licensing & Compatibility

The provided README does not specify a software license. Compatibility for commercial use or closed-source linking cannot be determined without this information.

Limitations & Caveats

The project is explicitly noted as being "under active development," with potential for evolving commands and examples. Comprehensive documentation for FAQs and developer guides is indicated as "coming soon."

Health Check

Last Commit

21 hours ago

Responsiveness

Inactive

Pull Requests (30d)

143

Issues (30d)

Star History

111 stars in the last 30 days