Multimodal-GPT by open-mmlab

Multimodal chatbot for visual/language instructions (research paper)

Created 2 years ago

1,516 stars

Top 27.1% on SourcePulse

View on GitHub

3 Experts Love This Project

Elvis Saravia

Founder of DAIR.AI

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Jesse Clark

Cofounder of Marqo

Project Summary

This project provides a framework for training multimodal chatbots capable of understanding and responding to visual and language instructions. It targets researchers and developers looking to build advanced conversational AI systems with visual reasoning capabilities, leveraging the OpenFlamingo architecture.

How It Works

Multimodal-GPT builds upon the OpenFlamingo open-source multimodal model. It enhances performance by jointly training the model on a diverse set of visual instruction datasets (VQA, image captioning, visual reasoning, OCR, visual dialogue) and language-only instruction data. This approach allows the model to learn complementary visual and linguistic cues, leading to improved multimodal understanding and generation.

Quick Start & Requirements

Installation: Clone the repository, cd into it, and run pip install -r requirements.txt followed by pip install -v -e .. Alternatively, use the provided environment.yml for Conda.
Prerequisites: Requires pre-trained weights for LLaMA (converted to Hugging Face format) and OpenFlamingo-9B. LoRA weights are also provided.
Demo: Launch the Gradio demo with python app.py after setting up the checkpoint directories.
Resources: Fine-tuning requires significant computational resources, including multiple GPUs (indicated by torchrun --nproc_per_node=8). Specific dataset downloads are necessary for fine-tuning.

Highlighted Details

Supports parameter-efficient fine-tuning with LoRA.
Enables simultaneous tuning of vision and language components.
Integrates various instruction datasets including A-OKVQA, COCO Caption, LLaVA, and Dolly.
Offers a local Gradio demo for interactive use.

Maintenance & Community

The project is part of the OpenMMLab ecosystem, known for its active development in computer vision. Acknowledgements include contributions from OpenFlamingo, LAVIS, and other prominent open-source projects.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, its reliance on LLaMA and OpenFlamingo suggests potential licensing considerations from those base models. Compatibility for commercial use would require careful review of all underlying component licenses.

Limitations & Caveats

The setup for running the demo and fine-tuning requires downloading and organizing multiple large pre-trained models and datasets, which can be time-consuming and resource-intensive. The fine-tuning process requires a substantial number of GPUs.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days