ShareGPT4V  by ShareGPT4Omni

Improving large multi-modal models with enhanced image-text understanding

Created 1 year ago
253 stars

Top 99.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

ShareGPT4V addresses the challenge of improving Large Multi-modal Models (LMMs) by introducing a large-scale, highly descriptive image-text dataset and a novel training methodology. Targeting researchers and engineers in the AI domain, the project offers a superior LMM and a capable general image captioner, enhancing visual understanding capabilities.

How It Works

The project leverages the ShareGPT4V dataset, comprising over 1.2 million high-quality captions, including 100K generated by GPT-4-Vision. Its core approach involves a two-stage training process: feature alignment to synchronize visual and textual modalities, followed by visual instruction tuning to imbue the model with multimodal instruction-following capabilities. This methodology yields a superior LMM, ShareGPT4V-7B, and a general image captioner, ShareCaptioner, that approaches GPT-4-Vision's performance.

Quick Start & Requirements

Installation requires cloning the repository, setting up a Conda environment with Python 3.10, and installing the project with its training dependencies (pip install -e . and pip install -e ".[train]"). GPU acceleration, specifically CUDA, is a prerequisite, as indicated by the training setup utilizing A100 GPUs. The project provides pre-trained models and demos on platforms like HuggingFace and OpenXLab.

Highlighted Details

  • ShareGPT4V Dataset: A large-scale, highly descriptive image-text dataset featuring 1.2M high-quality captions.
  • ShareCaptioner: A general image captioner achieving capabilities close to GPT-4-Vision.
  • ShareGPT4V-7B Model: A superior LMM demonstrating strong performance across various benchmarks like LLaVA-Bench-Wild, MME, and MMBench.
  • ECCV 2024 Publication: The work is accepted for presentation at ECCV 2024.

Maintenance & Community

The project is associated with authors from the University of Science and Technology of China and Shanghai AI Laboratory. While specific community channels like Discord/Slack are not listed, demos and model checkpoints are available on HuggingFace and OpenXLab, facilitating interaction and adoption.

Licensing & Compatibility

The data and checkpoints are licensed strictly for research purposes only. Usage is further restricted by the licenses of base models like LLaMA and Vicuna. The ShareGPT4V dataset itself is licensed under CC BY-NC 4.0, permitting only non-commercial use. Consequently, models trained on this dataset are also restricted to research applications and are not compatible with commercial use.

Limitations & Caveats

The primary limitation is the strict non-commercial, research-only usage restriction imposed by the dataset's CC BY-NC 4.0 license and the underlying base model licenses. This prohibits any deployment or use in commercial products or services.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Travis Fischer Travis Fischer(Founder of Agentic), and
5 more.

fromage by kohjingyu

0%
485
Multimodal model for grounding language models to images
Created 3 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
471
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
1 more.

CLIP_prefix_caption by rmokady

0%
1k
Image captioning model using CLIP embeddings as a prefix
Created 4 years ago
Updated 1 year ago
Feedback? Help us improve.