Vision-language model research paper using Mixture-of-Experts
Top 21.0% on sourcepulse
MoE-LLaVA introduces a Mixture-of-Experts (MoE) approach to enhance Large Vision-Language Models (LVLMs). It targets researchers and developers seeking efficient and high-performing multimodal models, offering improved capabilities with sparser parameter activation.
How It Works
MoE-LLaVA integrates a sparse MoE layer into existing vision-language architectures. This allows for selective activation of expert networks based on input, leading to more efficient computation and potentially better performance. The project leverages a simple MoE tuning stage, enabling rapid training on modest hardware.
Quick Start & Requirements
pip install -e .
and pip install -e ".[train]"
. flash-attn
is also recommended.Highlighted Details
Maintenance & Community
The project is actively maintained by the PKU-YuanGroup. Related projects include Video-LLaVA and LanguageBind.
Licensing & Compatibility
Released under Apache 2.0 license. However, usage is subject to the LLaMA model license, OpenAI's Terms of Use for generated data, and ShareGPT's privacy practices, implying restrictions on commercial use and data privacy.
Limitations & Caveats
The project notes that flash attention2 may cause performance degradation. The license terms for underlying models and data sources may impose significant restrictions on deployment and commercial use.
2 weeks ago
1 day