reap by CerebrasResearch

SMoE LLM compression via novel expert pruning

Created 9 months ago

442 stars

Top 67.0% on SourcePulse

Project Summary

Summary

This repository implements Router-weighted Expert Activation Pruning (REAP), a method for compressing Sparsely-activated Mixture-of-Experts (SMoE) Large Language Models (LLMs). It addresses the memory overhead of SMoEs by pruning experts, offering a significant advantage over existing methods, particularly at 50% compression. REAP enables near-lossless compression for critical tasks like code generation and tool-calling, making it valuable for researchers and engineers working with large-scale SMoE models.

How It Works

REAP introduces a novel expert pruning criterion that evaluates an expert's contribution based on both router gate-values and average activation norms. This approach contrasts with expert merging, which the authors argue leads to irreducible error and functional subspace collapse by diminishing the router's independent modulation capabilities. By preserving the router's control over the remaining experts, REAP maintains a larger functional output space, resulting in superior compression performance.

Quick Start & Requirements

Installation can be done via a virtual environment using uv and scripts/build.sh, or through Docker with docker compose up --build -d. Configuration involves copying and populating .env.template and potentially specific WildBench configuration files. Adding new models requires updating src/reap/model_util.py with model-specific attribute names for SMoE components. Experiment execution scripts (merging-cli.sh, pruning-cli.sh) accept arguments for CUDA devices, model names, pruning/merging methods, compression ratios, and evaluation flags.

Highlighted Details

REAP consistently outperforms expert merging and other pruning methods across diverse SMoE architectures (20B to 1T parameters) on generative benchmarks.
Achieves near-lossless compression on code generation and tool-calling tasks for models like Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.
Offers pre-trained HuggingFace checkpoints for various compressed SMoE models, including GLM4.6, Qwen3-Coder-480B, and Kimi-Linear.
The research provides theoretical backing for why pruning prevails over merging for one-shot SMoE compression.

Maintenance & Community

No specific details regarding maintenance, community channels (e.g., Discord, Slack), or notable contributors were found in the provided README.

Licensing & Compatibility

The README does not specify the software license or provide compatibility notes for commercial use or closed-source linking.

Limitations & Caveats

No explicit limitations, known bugs, or alpha status were mentioned in the provided README text.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)