llava-phi  by xmoanvaf

Multimodal assistant with small language models

Created 1 year ago
398 stars

Top 72.5% on SourcePulse

GitHubView on GitHub
Project Summary

LLaVA-Phi and Mipha are open-source projects focused on developing efficient multimodal assistants using small language models (SLMs). They aim to provide strong performance on various vision-language tasks, making them suitable for researchers and developers working with resource-constrained environments.

How It Works

The projects employ a two-stage training process. First, a feature alignment stage connects a frozen vision encoder (SigLIP-SO) to a frozen SLM (Phi-2 or Phi-1.5). Second, a visual instruction tuning stage uses a combination of GPT-generated multimodal instructions and academic VQA datasets to enable the model to follow multimodal commands. This approach leverages existing SLMs and vision encoders for efficient multimodal capabilities.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using pip install -e . within a conda environment (Python 3.10 recommended).
  • Prerequisites: Requires downloading base model weights for Phi-2 (or Phi-1.5) and SigLIP-SO, along with the LAION-CC-SBU dataset subset.
  • Resources: Training involves feature alignment and visual instruction tuning, with hyperparameters provided for both stages. CLI inference is supported.
  • Links: Mipha GitHub

Highlighted Details

  • Mipha-3B achieves 81.3 VQA, 63.9 GQA, and 70.9 SQA I scores, outperforming smaller models.
  • Mipha-1.6B and Mipha-2.4B offer competitive performance with even smaller parameter counts.
  • Models are accepted at ACMMM 2024 Workshop (LLaVA-Phi) and AAAI 2025 Main Track (Mipha).
  • Training scripts are available, including DeepSpeed ZeRO-3 for finetuning.

Maintenance & Community

The project is actively developed, with recent releases of Mipha-3B and associated training codes. It builds upon LLaVA, LLaMA-Factory, and Safe-RLHF.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify a license, which may impact commercial adoption. Detailed setup instructions for integrating base models and datasets are provided, but may require careful execution.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 5 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Zack Li Zack Li(Cofounder of Nexa AI), and
19 more.

LLaVA by haotian-liu

0.2%
24k
Multimodal assistant with GPT-4 level capabilities
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.