tokenize-anything by baaivision

Vision model for segmenting, recognizing, and captioning arbitrary regions

Created 2 years ago

599 stars

Top 54.5% on SourcePulse

Project Summary

This project introduces Tokenize Anything via Prompting (TAP), a unified model for segmenting, recognizing, and captioning arbitrary image regions using flexible visual prompts. It targets researchers and developers working on multimodal vision-language tasks, offering a versatile tool for detailed image analysis.

How It Works

TAP employs a modular design, decoupling components for flexibility. It leverages a large-scale pre-trained EVA-CLIP model (5 billion parameters) and is trained on exhaustive segmentation masks from SA-1B. This approach allows for simultaneous segmentation, recognition, and captioning of user-specified regions via points, boxes, or sketches.

Quick Start & Requirements

Installation: pip install git+ssh://git@github.com/baaivision/tokenize-anything.git or clone and pip install .
Prerequisites: torch >= 2.1, flash-attn >= 2.3.3 (for TextGeneration). gradio-image-prompter is required for the Gradio app.
Resources: Three model versions are available (ViT-H, ViT-L, ViT-B) with varying performance and resource needs. Weights are available via Hugging Face.
Links: Paper, Demo, Inference Guide, Concept Guide

Highlighted Details

Unified model for segmentation, recognition, and captioning.
Supports flexible visual prompts: points, boxes, and sketches.
Trained on SA-1B dataset and uses a 5B parameter EVA-CLIP model.
Offers multiple model sizes (ViT-H, ViT-L, ViT-B) for different use cases.

Maintenance & Community

The project is associated with BAAI and ICT-CAS. Acknowledgements mention contributions from SAM, EVA, LLaMA, FlashAttention, Gradio, and Detectron2.

Licensing & Compatibility

License: Apache License 2.0.
Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions different training schedules and dataset usage between V1.0 and V1.1 model releases, suggesting potential performance differences or breaking changes between versions.

tokenize-anything by baaivision

Explore Similar Projects

lens by ContextualAI

PixelRefer by alibaba-damo-academy

object-centric-ovd by hanoonaR

ZegCLIP by ZiqinZhou66

art-msra by microsoft

Osprey by CircleRadon

ComfyUI-YoloWorld-EfficientSAM by ZHO-ZHO-ZHO

ComfyUI-RMBG by 1038lab

describe-anything by NVlabs

CLIP_benchmark by LAION-AI

clipseg by timojl

minimind-v by jingyaogong