MIC  by HaozheZhao

VLM for multimodal in-context learning research

Created 2 years ago
354 stars

Top 78.8% on SourcePulse

GitHubView on GitHub
Project Summary

MMICL is a state-of-the-art Vision-Language Model (VLM) designed to enhance reasoning and contextual learning capabilities in multimodal AI. It addresses the limitations of traditional VLMs by incorporating in-context learning abilities, enabling users to adapt the model to new tasks without fine-tuning. The project targets researchers and developers working on advanced multimodal AI applications.

How It Works

MMICL leverages a multimodal in-context learning (M-ICL) approach, fine-tuning VLMs on the manually constructed MIC dataset. This dataset supports interleaved text-image inputs, multiple image inputs, and multimodal in-context learning scenarios. The architecture integrates a Vision Transformer (ViT) as the visual encoder with pre-trained Large Language Models (LLMs) like FlanT5 or Vicuna, enabling sophisticated understanding of complex relationships between instructions and visual data.

Quick Start & Requirements

  • Install: Create a conda environment using conda env create -f environment.yml.
  • Prerequisites: Ubuntu servers with 4/6 NVIDIA GeForce A40 (46G) GPUs, CUDA 11.3, Apex, DeepSpeed.
  • Data: MIC dataset (jsonl files and images) available on Hugging Face Hub and ModelScope.
  • Demo: Available at Demo for MMICL.

Highlighted Details

  • Achieved 1st place on MME and MMBench leaderboards as of August 2023.
  • Supports analysis and reasoning across multiple images simultaneously.
  • Outperforms VL models of similar size on complex visual reasoning tasks.
  • Demonstrates novel capabilities in video understanding and M-ICL.

Maintenance & Community

The project is associated with PKU and has released models like MMICL-FLANT5XXL and MMICL-Tiny. Further details on the model and dataset are promised.

Licensing & Compatibility

The project's licensing is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that Vicuna versions and Chat Mode are under development and may require careful parameter adjustment. Reproducing experiments may be difficult due to reliance on specific server configurations (NVIDIA DGX-A40).

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Starred by Jiaming Song Jiaming Song(Chief Scientist at Luma AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

Otter by EvolvingLMMs-Lab

0.0%
3k
Multimodal model for improved instruction following and in-context learning
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.