llava-phi by xmoanvaf

Multimodal assistant with small language models

Created 2 years ago

400 stars

Top 72.3% on SourcePulse

Project Summary

LLaVA-Phi and Mipha are open-source projects focused on developing efficient multimodal assistants using small language models (SLMs). They aim to provide strong performance on various vision-language tasks, making them suitable for researchers and developers working with resource-constrained environments.

How It Works

The projects employ a two-stage training process. First, a feature alignment stage connects a frozen vision encoder (SigLIP-SO) to a frozen SLM (Phi-2 or Phi-1.5). Second, a visual instruction tuning stage uses a combination of GPT-generated multimodal instructions and academic VQA datasets to enable the model to follow multimodal commands. This approach leverages existing SLMs and vision encoders for efficient multimodal capabilities.

Quick Start & Requirements

Install: Clone the repository and install dependencies using pip install -e . within a conda environment (Python 3.10 recommended).
Prerequisites: Requires downloading base model weights for Phi-2 (or Phi-1.5) and SigLIP-SO, along with the LAION-CC-SBU dataset subset.
Resources: Training involves feature alignment and visual instruction tuning, with hyperparameters provided for both stages. CLI inference is supported.
Links: Mipha GitHub

Highlighted Details

Mipha-3B achieves 81.3 VQA, 63.9 GQA, and 70.9 SQA I scores, outperforming smaller models.
Mipha-1.6B and Mipha-2.4B offer competitive performance with even smaller parameter counts.
Models are accepted at ACMMM 2024 Workshop (LLaVA-Phi) and AAAI 2025 Main Track (Mipha).
Training scripts are available, including DeepSpeed ZeRO-3 for finetuning.

Maintenance & Community

The project is actively developed, with recent releases of Mipha-3B and associated training codes. It builds upon LLaVA, LLaMA-Factory, and Safe-RLHF.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify a license, which may impact commercial adoption. Detailed setup instructions for integrating base models and datasets are provided, but may require careful execution.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days