llama-moe  by pjlab-sys4nlp

MoE model from LLaMA with continual pre-training

created 2 years ago
977 stars

Top 38.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

LLaMA-MoE provides a series of Mixture-of-Experts (MoE) language models derived from LLaMA, offering a smaller, more accessible alternative to larger MoE architectures. It targets researchers and developers seeking efficient MoE models for deployment and experimentation, enabling MoE capabilities with significantly reduced active parameter counts.

How It Works

LLaMA-MoE transforms LLaMA models by partitioning their Feed-Forward Networks (FFNs) into multiple "experts" and integrating a top-K gating mechanism. This is followed by continual pre-training using a blend of Sheared LLaMA data and filtered SlimPajama datasets, optimized with dynamic sampling weights. This approach allows for MoE models with only 3.0-3.5B active parameters, enhancing efficiency while retaining performance.

Quick Start & Requirements

  • Install: pip install -e .[dev] (after cloning the repo and setting up environment variables for CUDA, GCC, and PyTorch).
  • Prerequisites: Python >= 3.10, PyTorch with CUDA 11.8, flash-attn==2.0.1, requirements.txt dependencies.
  • Setup: Requires manual environment setup for CUDA and GCC, potentially including flash-attn compilation.
  • Demo: Quick Start section in README provides a runnable example.

Highlighted Details

  • Lightweight MoE models with 3.0-3.5B active parameters.
  • Supports multiple expert construction methods (Neuron-Independent, Neuron-Sharing) and gating strategies (TopK Noisy Gate, Switch Gating).
  • Integrates FlashAttention-v2 for faster continual pre-training.
  • Offers extensive monitoring for training metrics like gate load, loss, and utilization.
  • Benchmarks show competitive performance against other 2.7-3B models on various NLP tasks.

Maintenance & Community

The project is associated with the pjlab-sys4nlp organization. Further community or roadmap details are not explicitly provided in the README.

Licensing & Compatibility

The project is released under a license that permits research and commercial use, as indicated by the availability of model weights and the nature of the benchmarks. Specific license details are not explicitly stated in the README.

Limitations & Caveats

The installation process requires careful environment configuration, including specific versions of CUDA and GCC, and potential compilation of flash-attn. The README points to separate documentation for expert construction, continual pre-training, evaluation, and SFT, suggesting these are complex processes.

Health Check
Last commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
23 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

HALOs by ContextualAI

0.2%
873
Library for aligning LLMs using human-aware loss functions
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.