Caption-Anything by ttengwang

Image processing tool for generating tailored captions

Created 2 years ago

1,774 stars

Top 24.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Omar Sanseviero

DevRel at Google DeepMind

Junyang Lin

Core Maintainer at Alibaba Qwen

Project Summary

Caption-Anything is an interactive image captioning tool that leverages object segmentation, visual grounding, and large language models (LLMs) to generate descriptive text for specific image regions. It targets users who need fine-grained control over caption generation, offering both visual (e.g., click-based selection) and linguistic (e.g., length, sentiment) parameters for tailored output.

How It Works

The system integrates Segment Anything (SAM) for precise object segmentation, a visual captioning model (like BLIP-2) to generate initial descriptions, and ChatGPT for conversational refinement and style control. Users can interactively select objects via mouse clicks, and then prompt the LLM to modify captions based on desired sentiment, length, or language, enabling detailed object-specific discussions.

Quick Start & Requirements

Install via pip install -r requirements.txt.
Requires Python >= 3.8.1.
Needs an OpenAI API key.
GPU memory requirements vary: huge segmenter + blip2 captioner requires ~13GB; base segmenter + blip2 requires ~8.5GB; base segmenter + blip requires ~5.5GB.
Pre-downloaded SAM checkpoints are recommended for specific configurations.
Demo: https://huggingface.co/spaces/TencentARC/Caption-Anything

Highlighted Details

Interactive object selection via mouse clicks.
Language controls for caption length, sentiment, and language.
Supports "chatting" about selected objects for deeper understanding.
Integrates with LangChain and Visual Question Answering (VQA).
Recent updates include support for captioning paragraphs and mouse trajectory as a visual control.

Maintenance & Community

The project is primarily driven by Teng Wang and contributors from various institutions including Southern University of Science and Technology, HKU, and Tencent ARC Lab.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

The project relies on external APIs (OpenAI) which may incur costs. GPU memory requirements can be substantial for higher-fidelity models. The "mouse trajectory" control is noted as beta.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days