Caption-Anything  by ttengwang

Image processing tool for generating tailored captions

created 2 years ago
1,755 stars

Top 25.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Caption-Anything is an interactive image captioning tool that leverages object segmentation, visual grounding, and large language models (LLMs) to generate descriptive text for specific image regions. It targets users who need fine-grained control over caption generation, offering both visual (e.g., click-based selection) and linguistic (e.g., length, sentiment) parameters for tailored output.

How It Works

The system integrates Segment Anything (SAM) for precise object segmentation, a visual captioning model (like BLIP-2) to generate initial descriptions, and ChatGPT for conversational refinement and style control. Users can interactively select objects via mouse clicks, and then prompt the LLM to modify captions based on desired sentiment, length, or language, enabling detailed object-specific discussions.

Quick Start & Requirements

  • Install via pip install -r requirements.txt.
  • Requires Python >= 3.8.1.
  • Needs an OpenAI API key.
  • GPU memory requirements vary: huge segmenter + blip2 captioner requires ~13GB; base segmenter + blip2 requires ~8.5GB; base segmenter + blip requires ~5.5GB.
  • Pre-downloaded SAM checkpoints are recommended for specific configurations.
  • Demo: https://huggingface.co/spaces/TencentARC/Caption-Anything

Highlighted Details

  • Interactive object selection via mouse clicks.
  • Language controls for caption length, sentiment, and language.
  • Supports "chatting" about selected objects for deeper understanding.
  • Integrates with LangChain and Visual Question Answering (VQA).
  • Recent updates include support for captioning paragraphs and mouse trajectory as a visual control.

Maintenance & Community

The project is primarily driven by Teng Wang and contributors from various institutions including Southern University of Science and Technology, HKU, and Tencent ARC Lab.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

The project relies on external APIs (OpenAI) which may incur costs. GPU memory requirements can be substantial for higher-fidelity models. The "mouse trajectory" control is noted as beta.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
22 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0%
352
Vision-language research paper using LLMs
created 2 years ago
updated 1 week ago
Feedback? Help us improve.