tokenize-anything  by baaivision

Vision model for segmenting, recognizing, and captioning arbitrary regions

created 1 year ago
587 stars

Top 56.1% on sourcepulse

GitHubView on GitHub
Project Summary

This project introduces Tokenize Anything via Prompting (TAP), a unified model for segmenting, recognizing, and captioning arbitrary image regions using flexible visual prompts. It targets researchers and developers working on multimodal vision-language tasks, offering a versatile tool for detailed image analysis.

How It Works

TAP employs a modular design, decoupling components for flexibility. It leverages a large-scale pre-trained EVA-CLIP model (5 billion parameters) and is trained on exhaustive segmentation masks from SA-1B. This approach allows for simultaneous segmentation, recognition, and captioning of user-specified regions via points, boxes, or sketches.

Quick Start & Requirements

  • Installation: pip install git+ssh://git@github.com/baaivision/tokenize-anything.git or clone and pip install .
  • Prerequisites: torch >= 2.1, flash-attn >= 2.3.3 (for TextGeneration). gradio-image-prompter is required for the Gradio app.
  • Resources: Three model versions are available (ViT-H, ViT-L, ViT-B) with varying performance and resource needs. Weights are available via Hugging Face.
  • Links: Paper, Demo, Inference Guide, Concept Guide

Highlighted Details

  • Unified model for segmentation, recognition, and captioning.
  • Supports flexible visual prompts: points, boxes, and sketches.
  • Trained on SA-1B dataset and uses a 5B parameter EVA-CLIP model.
  • Offers multiple model sizes (ViT-H, ViT-L, ViT-B) for different use cases.

Maintenance & Community

The project is associated with BAAI and ICT-CAS. Acknowledgements mention contributions from SAM, EVA, LLaMA, FlashAttention, Gradio, and Detectron2.

Licensing & Compatibility

  • License: Apache License 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions different training schedules and dataset usage between V1.0 and V1.1 model releases, suggesting potential performance differences or breaking changes between versions.

Health Check
Last commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.