checkpoint-engine  by MoonshotAI

Middleware for efficient LLM weight updates during inference

Created 1 week ago

New!

701 stars

Top 48.7% on SourcePulse

GitHubView on GitHub
Project Summary

Checkpoint-engine provides efficient middleware for updating LLM model weights in inference engines, crucial for reinforcement learning. It targets engineers needing fast, inplace weight updates across distributed GPU setups, offering significant performance gains.

How It Works

The core ParameterServer manages updates via Broadcast (synchronous, high-throughput) and P2P (dynamic instances via mooncake-transfer-engine). Broadcast optimizes transfers through a 3-stage pipeline (H2D, inter-worker broadcast, engine reload) with overlapped communication/copy, falling back to serial execution if GPU memory is constrained.

Quick Start & Requirements

  • Install: pip install checkpoint-engine or pip install 'checkpoint-engine[p2p]'.
  • Prerequisites: vLLM (v0.10.2rc1, specific API commit), Python 3.12, potentially H800/H20 GPUs. FP8 requires vLLM patches.
  • Setup: Involves cloning vLLM, environment setup, installing dependencies, model download, and launching vLLM with VllmColocateWorkerExtension.
  • Docs/Demo: README provides setup guide and demo commands. Link: vLLM.

Highlighted Details

  • Updates 1T parameter model (Kimi-K2) in ~20s across thousands of GPUs.
  • Benchmarks show efficient updates: e.g., 1.42 GiB in 3.94s (Broadcast) on 8xH800 for GLM-4.5-Air.
  • Supports dynamic joining of new inference instances, reusing weights.
  • Implements pipelined data transfer, overlapping communication and computation.

Maintenance & Community

No specific community links (Discord, Slack) or roadmap details are provided in the README. Mentions contributions from youkaichao regarding vLLM integration.

Licensing & Compatibility

The license type and any compatibility restrictions are not specified in the provided README content.

Limitations & Caveats

  • Currently vLLM-specific; other frameworks (SGLang) integration is planned.
  • Full three-stage pipeline not yet implemented.
  • P2P update method has potential optimizations.
  • FP8 support requires specific patches and may have compatibility issues beyond tested models.
Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
8
Issues (30d)
7
Star History
709 stars in the last 10 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

LightLLM by ModelTC

0.5%
4k
Python framework for LLM inference and serving
Created 2 years ago
Updated 12 hours ago
Starred by Carol Willing Carol Willing(Core Contributor to CPython, Jupyter), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

dynamo by ai-dynamo

1.0%
5k
Inference framework for distributed generative AI model serving
Created 6 months ago
Updated 13 hours ago
Feedback? Help us improve.