Vision model for segmenting, recognizing, and captioning arbitrary regions
Top 56.1% on sourcepulse
This project introduces Tokenize Anything via Prompting (TAP), a unified model for segmenting, recognizing, and captioning arbitrary image regions using flexible visual prompts. It targets researchers and developers working on multimodal vision-language tasks, offering a versatile tool for detailed image analysis.
How It Works
TAP employs a modular design, decoupling components for flexibility. It leverages a large-scale pre-trained EVA-CLIP model (5 billion parameters) and is trained on exhaustive segmentation masks from SA-1B. This approach allows for simultaneous segmentation, recognition, and captioning of user-specified regions via points, boxes, or sketches.
Quick Start & Requirements
pip install git+ssh://git@github.com/baaivision/tokenize-anything.git
or clone and pip install .
torch >= 2.1
, flash-attn >= 2.3.3
(for TextGeneration). gradio-image-prompter
is required for the Gradio app.Highlighted Details
Maintenance & Community
The project is associated with BAAI and ICT-CAS. Acknowledgements mention contributions from SAM, EVA, LLaMA, FlashAttention, Gradio, and Detectron2.
Licensing & Compatibility
Limitations & Caveats
The README mentions different training schedules and dataset usage between V1.0 and V1.1 model releases, suggesting potential performance differences or breaking changes between versions.
7 months ago
1 day