llava-phi  by xmoanvaf

Multimodal assistant with small language models

created 1 year ago
392 stars

Top 74.5% on sourcepulse

GitHubView on GitHub
Project Summary

LLaVA-Phi and Mipha are open-source projects focused on developing efficient multimodal assistants using small language models (SLMs). They aim to provide strong performance on various vision-language tasks, making them suitable for researchers and developers working with resource-constrained environments.

How It Works

The projects employ a two-stage training process. First, a feature alignment stage connects a frozen vision encoder (SigLIP-SO) to a frozen SLM (Phi-2 or Phi-1.5). Second, a visual instruction tuning stage uses a combination of GPT-generated multimodal instructions and academic VQA datasets to enable the model to follow multimodal commands. This approach leverages existing SLMs and vision encoders for efficient multimodal capabilities.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using pip install -e . within a conda environment (Python 3.10 recommended).
  • Prerequisites: Requires downloading base model weights for Phi-2 (or Phi-1.5) and SigLIP-SO, along with the LAION-CC-SBU dataset subset.
  • Resources: Training involves feature alignment and visual instruction tuning, with hyperparameters provided for both stages. CLI inference is supported.
  • Links: Mipha GitHub

Highlighted Details

  • Mipha-3B achieves 81.3 VQA, 63.9 GQA, and 70.9 SQA I scores, outperforming smaller models.
  • Mipha-1.6B and Mipha-2.4B offer competitive performance with even smaller parameter counts.
  • Models are accepted at ACMMM 2024 Workshop (LLaVA-Phi) and AAAI 2025 Main Track (Mipha).
  • Training scripts are available, including DeepSpeed ZeRO-3 for finetuning.

Maintenance & Community

The project is actively developed, with recent releases of Mipha-3B and associated training codes. It builds upon LLaVA, LLaMA-Factory, and Safe-RLHF.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify a license, which may impact commercial adoption. Detailed setup instructions for integrating base models and datasets are provided, but may require careful execution.

Health Check
Last commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
14 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.