Image2Paragraph  by showlab

Image-to-paragraph generator using multimodal AI

created 2 years ago
815 stars

Top 44.4% on sourcepulse

GitHubView on GitHub
Project Summary

This project transforms images into descriptive paragraphs using a pipeline of advanced vision and language models, targeting users who need automated image summarization or rich textual descriptions. It leverages multiple state-of-the-art models to extract detailed information, offering a more comprehensive output than traditional image captioning.

How It Works

The core approach involves a multi-stage process: BLIP2 generates an initial image caption, GRIT provides dense captions for finer details, and Segment Anything (or Semantic-Segment-Anything) identifies region-level semantics. These extracted textual features are then synthesized by ChatGPT (or GPT-4) into a coherent, unique paragraph, potentially improving retrieval accuracy over direct image-text matching.

Quick Start & Requirements

  • Install via pip install -r requirements.txt (details in install.md).
  • Run with python main.py --image_src [image_path] --out_image_name [out_file_name].
  • Requires an OpenAI API key.
  • Supports CPU inference for low-memory GPUs (<8GB) or no GPU.
  • For faster inference on GPUs >15GB, set devices to cuda.
  • Demo available on Huggingface.

Highlighted Details

  • Achieves <20s inference time on an 8GB GPU.
  • Demonstrates improved retrieval results (IR@1: 49.7, TR@1: 36.1) by compressing images into paragraphs compared to direct image-text retrieval.
  • Integrates models like BLIP2, GRIT, Segment Anything, and ControlNet.
  • Supports GPT-4 for enhanced positional accuracy in generated text.

Maintenance & Community

The project is actively developed, with recent updates in April 2023. Contact is available via email or GitHub issues for suggestions.

Licensing & Compatibility

The README does not explicitly state a license. The project relies on several other models, whose individual licenses should be consulted for compatibility, especially for commercial use.

Limitations & Caveats

The project relies on external APIs like OpenAI, which may incur costs. GPT-3.5 is noted to sometimes miss positional information, recommending GPT-4 for better results. The project is still under active development with items on its "To Do List."

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.