MIC by HaozheZhao

VLM for multimodal in-context learning research

Created 2 years ago

358 stars

Top 78.3% on SourcePulse

Project Summary

MMICL is a state-of-the-art Vision-Language Model (VLM) designed to enhance reasoning and contextual learning capabilities in multimodal AI. It addresses the limitations of traditional VLMs by incorporating in-context learning abilities, enabling users to adapt the model to new tasks without fine-tuning. The project targets researchers and developers working on advanced multimodal AI applications.

How It Works

MMICL leverages a multimodal in-context learning (M-ICL) approach, fine-tuning VLMs on the manually constructed MIC dataset. This dataset supports interleaved text-image inputs, multiple image inputs, and multimodal in-context learning scenarios. The architecture integrates a Vision Transformer (ViT) as the visual encoder with pre-trained Large Language Models (LLMs) like FlanT5 or Vicuna, enabling sophisticated understanding of complex relationships between instructions and visual data.

Quick Start & Requirements

Install: Create a conda environment using conda env create -f environment.yml.
Prerequisites: Ubuntu servers with 4/6 NVIDIA GeForce A40 (46G) GPUs, CUDA 11.3, Apex, DeepSpeed.
Data: MIC dataset (jsonl files and images) available on Hugging Face Hub and ModelScope.
Demo: Available at Demo for MMICL.

Highlighted Details

Achieved 1st place on MME and MMBench leaderboards as of August 2023.
Supports analysis and reasoning across multiple images simultaneously.
Outperforms VL models of similar size on complex visual reasoning tasks.
Demonstrates novel capabilities in video understanding and M-ICL.

Maintenance & Community

The project is associated with PKU and has released models like MMICL-FLANT5XXL and MMICL-Tiny. Further details on the model and dataset are promised.

Licensing & Compatibility

The project's licensing is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that Vicuna versions and Chat Mode are under development and may require careful parameter adjustment. Reproducing experiments may be difficult due to reliance on specific server configurations (NVIDIA DGX-A40).

MIC by HaozheZhao

Explore Similar Projects

HPT by HyperGAI

BLIVA by mlpc-ucsd

Cheetah by DCDmllm

METER by zdou0830

llava-phi by xmoanvaf

PandaGPT by yxuansu

BLIP3o by JiuhaiChen

molmo by allenai

Otter by EvolvingLMMs-Lab

NExT-GPT by NExT-GPT

minimind-v by jingyaogong

InternVL by OpenGVLab