Discover and explore top open-source AI tools and projects—updated daily.
baaivisionVision model for segmenting, recognizing, and captioning arbitrary regions
Top 54.6% on SourcePulse
This project introduces Tokenize Anything via Prompting (TAP), a unified model for segmenting, recognizing, and captioning arbitrary image regions using flexible visual prompts. It targets researchers and developers working on multimodal vision-language tasks, offering a versatile tool for detailed image analysis.
How It Works
TAP employs a modular design, decoupling components for flexibility. It leverages a large-scale pre-trained EVA-CLIP model (5 billion parameters) and is trained on exhaustive segmentation masks from SA-1B. This approach allows for simultaneous segmentation, recognition, and captioning of user-specified regions via points, boxes, or sketches.
Quick Start & Requirements
pip install git+ssh://git@github.com/baaivision/tokenize-anything.git or clone and pip install .torch >= 2.1, flash-attn >= 2.3.3 (for TextGeneration). gradio-image-prompter is required for the Gradio app.Highlighted Details
Maintenance & Community
The project is associated with BAAI and ICT-CAS. Acknowledgements mention contributions from SAM, EVA, LLaMA, FlashAttention, Gradio, and Detectron2.
Licensing & Compatibility
Limitations & Caveats
The README mentions different training schedules and dataset usage between V1.0 and V1.1 model releases, suggesting potential performance differences or breaking changes between versions.
10 months ago
Inactive
ContextualAI
LAION-AI