MoE model from LLaMA with continual pre-training
Top 38.6% on sourcepulse
LLaMA-MoE provides a series of Mixture-of-Experts (MoE) language models derived from LLaMA, offering a smaller, more accessible alternative to larger MoE architectures. It targets researchers and developers seeking efficient MoE models for deployment and experimentation, enabling MoE capabilities with significantly reduced active parameter counts.
How It Works
LLaMA-MoE transforms LLaMA models by partitioning their Feed-Forward Networks (FFNs) into multiple "experts" and integrating a top-K gating mechanism. This is followed by continual pre-training using a blend of Sheared LLaMA data and filtered SlimPajama datasets, optimized with dynamic sampling weights. This approach allows for MoE models with only 3.0-3.5B active parameters, enhancing efficiency while retaining performance.
Quick Start & Requirements
pip install -e .[dev]
(after cloning the repo and setting up environment variables for CUDA, GCC, and PyTorch).flash-attn==2.0.1
, requirements.txt
dependencies.flash-attn
compilation.Highlighted Details
Maintenance & Community
The project is associated with the pjlab-sys4nlp organization. Further community or roadmap details are not explicitly provided in the README.
Licensing & Compatibility
The project is released under a license that permits research and commercial use, as indicated by the availability of model weights and the nature of the benchmarks. Specific license details are not explicitly stated in the README.
Limitations & Caveats
The installation process requires careful environment configuration, including specific versions of CUDA and GCC, and potential compilation of flash-attn
. The README points to separate documentation for expert construction, continual pre-training, evaluation, and SFT, suggesting these are complex processes.
7 months ago
1 day