Multimodal assistant with small language models
Top 74.5% on sourcepulse
LLaVA-Phi and Mipha are open-source projects focused on developing efficient multimodal assistants using small language models (SLMs). They aim to provide strong performance on various vision-language tasks, making them suitable for researchers and developers working with resource-constrained environments.
How It Works
The projects employ a two-stage training process. First, a feature alignment stage connects a frozen vision encoder (SigLIP-SO) to a frozen SLM (Phi-2 or Phi-1.5). Second, a visual instruction tuning stage uses a combination of GPT-generated multimodal instructions and academic VQA datasets to enable the model to follow multimodal commands. This approach leverages existing SLMs and vision encoders for efficient multimodal capabilities.
Quick Start & Requirements
pip install -e .
within a conda
environment (Python 3.10 recommended).Highlighted Details
Maintenance & Community
The project is actively developed, with recent releases of Mipha-3B and associated training codes. It builds upon LLaVA, LLaMA-Factory, and Safe-RLHF.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The README does not specify a license, which may impact commercial adoption. Detailed setup instructions for integrating base models and datasets are provided, but may require careful execution.
7 months ago
1 week