MIC  by HaozheZhao

VLM for multimodal in-context learning research

created 2 years ago
352 stars

Top 80.3% on sourcepulse

GitHubView on GitHub
Project Summary

MMICL is a state-of-the-art Vision-Language Model (VLM) designed to enhance reasoning and contextual learning capabilities in multimodal AI. It addresses the limitations of traditional VLMs by incorporating in-context learning abilities, enabling users to adapt the model to new tasks without fine-tuning. The project targets researchers and developers working on advanced multimodal AI applications.

How It Works

MMICL leverages a multimodal in-context learning (M-ICL) approach, fine-tuning VLMs on the manually constructed MIC dataset. This dataset supports interleaved text-image inputs, multiple image inputs, and multimodal in-context learning scenarios. The architecture integrates a Vision Transformer (ViT) as the visual encoder with pre-trained Large Language Models (LLMs) like FlanT5 or Vicuna, enabling sophisticated understanding of complex relationships between instructions and visual data.

Quick Start & Requirements

  • Install: Create a conda environment using conda env create -f environment.yml.
  • Prerequisites: Ubuntu servers with 4/6 NVIDIA GeForce A40 (46G) GPUs, CUDA 11.3, Apex, DeepSpeed.
  • Data: MIC dataset (jsonl files and images) available on Hugging Face Hub and ModelScope.
  • Demo: Available at Demo for MMICL.

Highlighted Details

  • Achieved 1st place on MME and MMBench leaderboards as of August 2023.
  • Supports analysis and reasoning across multiple images simultaneously.
  • Outperforms VL models of similar size on complex visual reasoning tasks.
  • Demonstrates novel capabilities in video understanding and M-ICL.

Maintenance & Community

The project is associated with PKU and has released models like MMICL-FLANT5XXL and MMICL-Tiny. Further details on the model and dataset are promised.

Licensing & Compatibility

The project's licensing is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that Vicuna versions and Chat Mode are under development and may require careful parameter adjustment. Reproducing experiments may be difficult due to reliance on specific server configurations (NVIDIA DGX-A40).

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.