Image2Paragraph  by showlab

Image-to-paragraph generator using multimodal AI

Created 2 years ago
820 stars

Top 43.3% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project transforms images into descriptive paragraphs using a pipeline of advanced vision and language models, targeting users who need automated image summarization or rich textual descriptions. It leverages multiple state-of-the-art models to extract detailed information, offering a more comprehensive output than traditional image captioning.

How It Works

The core approach involves a multi-stage process: BLIP2 generates an initial image caption, GRIT provides dense captions for finer details, and Segment Anything (or Semantic-Segment-Anything) identifies region-level semantics. These extracted textual features are then synthesized by ChatGPT (or GPT-4) into a coherent, unique paragraph, potentially improving retrieval accuracy over direct image-text matching.

Quick Start & Requirements

  • Install via pip install -r requirements.txt (details in install.md).
  • Run with python main.py --image_src [image_path] --out_image_name [out_file_name].
  • Requires an OpenAI API key.
  • Supports CPU inference for low-memory GPUs (<8GB) or no GPU.
  • For faster inference on GPUs >15GB, set devices to cuda.
  • Demo available on Huggingface.

Highlighted Details

  • Achieves <20s inference time on an 8GB GPU.
  • Demonstrates improved retrieval results (IR@1: 49.7, TR@1: 36.1) by compressing images into paragraphs compared to direct image-text retrieval.
  • Integrates models like BLIP2, GRIT, Segment Anything, and ControlNet.
  • Supports GPT-4 for enhanced positional accuracy in generated text.

Maintenance & Community

The project is actively developed, with recent updates in April 2023. Contact is available via email or GitHub issues for suggestions.

Licensing & Compatibility

The README does not explicitly state a license. The project relies on several other models, whose individual licenses should be consulted for compatibility, especially for commercial use.

Limitations & Caveats

The project relies on external APIs like OpenAI, which may incur costs. GPT-3.5 is noted to sometimes miss positional information, recommending GPT-4 for better results. The project is still under active development with items on its "To Do List."

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Starred by Max Howell Max Howell(Author of Homebrew), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

big-sleep by lucidrains

0%
3k
CLI tool for text-to-image generation
Created 4 years ago
Updated 3 years ago
Feedback? Help us improve.