llama-moe by pjlab-sys4nlp

MoE model from LLaMA with continual pre-training

Created 2 years ago

1,003 stars

Top 37.1% on SourcePulse

View on GitHub

3 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Casper Hansen

Author of AutoAWQ

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

LLaMA-MoE provides a series of Mixture-of-Experts (MoE) language models derived from LLaMA, offering a smaller, more accessible alternative to larger MoE architectures. It targets researchers and developers seeking efficient MoE models for deployment and experimentation, enabling MoE capabilities with significantly reduced active parameter counts.

How It Works

LLaMA-MoE transforms LLaMA models by partitioning their Feed-Forward Networks (FFNs) into multiple "experts" and integrating a top-K gating mechanism. This is followed by continual pre-training using a blend of Sheared LLaMA data and filtered SlimPajama datasets, optimized with dynamic sampling weights. This approach allows for MoE models with only 3.0-3.5B active parameters, enhancing efficiency while retaining performance.

Quick Start & Requirements

Install: pip install -e .[dev] (after cloning the repo and setting up environment variables for CUDA, GCC, and PyTorch).
Prerequisites: Python >= 3.10, PyTorch with CUDA 11.8, flash-attn==2.0.1, requirements.txt dependencies.
Setup: Requires manual environment setup for CUDA and GCC, potentially including flash-attn compilation.
Demo: Quick Start section in README provides a runnable example.

Highlighted Details

Lightweight MoE models with 3.0-3.5B active parameters.
Supports multiple expert construction methods (Neuron-Independent, Neuron-Sharing) and gating strategies (TopK Noisy Gate, Switch Gating).
Integrates FlashAttention-v2 for faster continual pre-training.
Offers extensive monitoring for training metrics like gate load, loss, and utilization.
Benchmarks show competitive performance against other 2.7-3B models on various NLP tasks.

Maintenance & Community

The project is associated with the pjlab-sys4nlp organization. Further community or roadmap details are not explicitly provided in the README.

Licensing & Compatibility

The project is released under a license that permits research and commercial use, as indicated by the availability of model weights and the nature of the benchmarks. Specific license details are not explicitly stated in the README.

Limitations & Caveats

The installation process requires careful environment configuration, including specific versions of CUDA and GCC, and potential compilation of flash-attn. The README points to separate documentation for expert construction, continual pre-training, evaluation, and SFT, suggesting these are complex processes.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days