Multimodal-GPT  by open-mmlab

Multimodal chatbot for visual/language instructions (research paper)

created 2 years ago
1,506 stars

Top 27.9% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a framework for training multimodal chatbots capable of understanding and responding to visual and language instructions. It targets researchers and developers looking to build advanced conversational AI systems with visual reasoning capabilities, leveraging the OpenFlamingo architecture.

How It Works

Multimodal-GPT builds upon the OpenFlamingo open-source multimodal model. It enhances performance by jointly training the model on a diverse set of visual instruction datasets (VQA, image captioning, visual reasoning, OCR, visual dialogue) and language-only instruction data. This approach allows the model to learn complementary visual and linguistic cues, leading to improved multimodal understanding and generation.

Quick Start & Requirements

  • Installation: Clone the repository, cd into it, and run pip install -r requirements.txt followed by pip install -v -e .. Alternatively, use the provided environment.yml for Conda.
  • Prerequisites: Requires pre-trained weights for LLaMA (converted to Hugging Face format) and OpenFlamingo-9B. LoRA weights are also provided.
  • Demo: Launch the Gradio demo with python app.py after setting up the checkpoint directories.
  • Resources: Fine-tuning requires significant computational resources, including multiple GPUs (indicated by torchrun --nproc_per_node=8). Specific dataset downloads are necessary for fine-tuning.

Highlighted Details

  • Supports parameter-efficient fine-tuning with LoRA.
  • Enables simultaneous tuning of vision and language components.
  • Integrates various instruction datasets including A-OKVQA, COCO Caption, LLaVA, and Dolly.
  • Offers a local Gradio demo for interactive use.

Maintenance & Community

The project is part of the OpenMMLab ecosystem, known for its active development in computer vision. Acknowledgements include contributions from OpenFlamingo, LAVIS, and other prominent open-source projects.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, its reliance on LLaMA and OpenFlamingo suggests potential licensing considerations from those base models. Compatibility for commercial use would require careful review of all underlying component licenses.

Limitations & Caveats

The setup for running the demo and fine-tuning requires downloading and organizing multiple large pre-trained models and datasets, which can be time-consuming and resource-intensive. The fine-tuning process requires a substantial number of GPUs.

Health Check
Last commit

2 years ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.