MoE-LLaVA  by PKU-YuanGroup

Vision-language model research paper using Mixture-of-Experts

created 1 year ago
2,199 stars

Top 21.0% on sourcepulse

GitHubView on GitHub
Project Summary

MoE-LLaVA introduces a Mixture-of-Experts (MoE) approach to enhance Large Vision-Language Models (LVLMs). It targets researchers and developers seeking efficient and high-performing multimodal models, offering improved capabilities with sparser parameter activation.

How It Works

MoE-LLaVA integrates a sparse MoE layer into existing vision-language architectures. This allows for selective activation of expert networks based on input, leading to more efficient computation and potentially better performance. The project leverages a simple MoE tuning stage, enabling rapid training on modest hardware.

Quick Start & Requirements

  • Install: Clone the repository, activate a Python 3.10 environment, and install dependencies with pip install -e . and pip install -e ".[train]". flash-attn is also recommended.
  • Prerequisites: Python 3.10, PyTorch 2.0.1, CUDA >= 11.7, Transformers 4.37.0, Tokenizers 0.15.1.
  • Resources: Training can be completed on 8 A100 GPUs within a day.
  • Demos: Hugging Face Spaces demo and Colab notebook are available.

Highlighted Details

  • Achieves performance comparable to LLaVA-1.5-7B with only 3B sparsely activated parameters.
  • Outperforms LLaVA-1.5-13B on object hallucination benchmarks.
  • Offers models based on Phi2, Qwen, and StableLM backbones.
  • Supports Gradio Web UI and CLI inference.

Maintenance & Community

The project is actively maintained by the PKU-YuanGroup. Related projects include Video-LLaVA and LanguageBind.

Licensing & Compatibility

Released under Apache 2.0 license. However, usage is subject to the LLaMA model license, OpenAI's Terms of Use for generated data, and ShareGPT's privacy practices, implying restrictions on commercial use and data privacy.

Limitations & Caveats

The project notes that flash attention2 may cause performance degradation. The license terms for underlying models and data sources may impose significant restrictions on deployment and commercial use.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
55 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.