tokenize-anything  by baaivision

Vision model for segmenting, recognizing, and captioning arbitrary regions

Created 1 year ago
596 stars

Top 54.6% on SourcePulse

GitHubView on GitHub
Project Summary

This project introduces Tokenize Anything via Prompting (TAP), a unified model for segmenting, recognizing, and captioning arbitrary image regions using flexible visual prompts. It targets researchers and developers working on multimodal vision-language tasks, offering a versatile tool for detailed image analysis.

How It Works

TAP employs a modular design, decoupling components for flexibility. It leverages a large-scale pre-trained EVA-CLIP model (5 billion parameters) and is trained on exhaustive segmentation masks from SA-1B. This approach allows for simultaneous segmentation, recognition, and captioning of user-specified regions via points, boxes, or sketches.

Quick Start & Requirements

  • Installation: pip install git+ssh://git@github.com/baaivision/tokenize-anything.git or clone and pip install .
  • Prerequisites: torch >= 2.1, flash-attn >= 2.3.3 (for TextGeneration). gradio-image-prompter is required for the Gradio app.
  • Resources: Three model versions are available (ViT-H, ViT-L, ViT-B) with varying performance and resource needs. Weights are available via Hugging Face.
  • Links: Paper, Demo, Inference Guide, Concept Guide

Highlighted Details

  • Unified model for segmentation, recognition, and captioning.
  • Supports flexible visual prompts: points, boxes, and sketches.
  • Trained on SA-1B dataset and uses a 5B parameter EVA-CLIP model.
  • Offers multiple model sizes (ViT-H, ViT-L, ViT-B) for different use cases.

Maintenance & Community

The project is associated with BAAI and ICT-CAS. Acknowledgements mention contributions from SAM, EVA, LLaMA, FlashAttention, Gradio, and Detectron2.

Licensing & Compatibility

  • License: Apache License 2.0.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The README mentions different training schedules and dataset usage between V1.0 and V1.1 model releases, suggesting potential performance differences or breaking changes between versions.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0%
354
Vision-language research paper using LLMs
Created 2 years ago
Updated 3 months ago
Feedback? Help us improve.