Image2Paragraph by showlab

Image-to-paragraph generator using multimodal AI

Created 2 years ago

825 stars

Top 43.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jesse Clark

Cofounder of Marqo

Project Summary

This project transforms images into descriptive paragraphs using a pipeline of advanced vision and language models, targeting users who need automated image summarization or rich textual descriptions. It leverages multiple state-of-the-art models to extract detailed information, offering a more comprehensive output than traditional image captioning.

How It Works

The core approach involves a multi-stage process: BLIP2 generates an initial image caption, GRIT provides dense captions for finer details, and Segment Anything (or Semantic-Segment-Anything) identifies region-level semantics. These extracted textual features are then synthesized by ChatGPT (or GPT-4) into a coherent, unique paragraph, potentially improving retrieval accuracy over direct image-text matching.

Quick Start & Requirements

Install via pip install -r requirements.txt (details in install.md).
Run with python main.py --image_src [image_path] --out_image_name [out_file_name].
Requires an OpenAI API key.
Supports CPU inference for low-memory GPUs (<8GB) or no GPU.
For faster inference on GPUs >15GB, set devices to cuda.
Demo available on Huggingface.

Highlighted Details

Achieves <20s inference time on an 8GB GPU.
Demonstrates improved retrieval results (IR@1: 49.7, TR@1: 36.1) by compressing images into paragraphs compared to direct image-text retrieval.
Integrates models like BLIP2, GRIT, Segment Anything, and ControlNet.
Supports GPT-4 for enhanced positional accuracy in generated text.

Maintenance & Community

The project is actively developed, with recent updates in April 2023. Contact is available via email or GitHub issues for suggestions.

Licensing & Compatibility

The README does not explicitly state a license. The project relies on several other models, whose individual licenses should be consulted for compatibility, especially for commercial use.

Limitations & Caveats

The project relies on external APIs like OpenAI, which may incur costs. GPT-3.5 is noted to sometimes miss positional information, recommending GPT-4 for better results. The project is still under active development with items on its "To Do List."

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days