ComfyUI-Florence2  by kijai

ComfyUI nodes for Florence2 vision-language model inference

Created 1 year ago
1,434 stars

Top 28.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a ComfyUI custom node for running Microsoft's Florence-2 Vision Language Model (VLM). It enables users to perform various vision tasks, including object detection, segmentation, captioning, and notably, Document Visual Question Answering (DocVQA), by leveraging Florence-2's prompt-based approach and sequence-to-sequence architecture.

How It Works

Florence-2 is a powerful VLM trained on the extensive FLD-5B dataset, allowing it to handle diverse vision-language tasks through simple text prompts. This node integrates Florence-2 into the ComfyUI workflow, facilitating tasks like DocVQA by allowing users to ask questions about document images and receive answers derived from the visual and textual content.

Quick Start & Requirements

Highlighted Details

  • Adds Document Visual Question Answering (DocVQA) capability to ComfyUI.
  • Supports a wide range of Florence-2 models and tested finetunes.
  • Enables prompt-based vision tasks like captioning, object detection, and segmentation.
  • Integrates seamlessly as a ComfyUI custom node.

Maintenance & Community

No specific community links or maintenance details are provided in the README.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is not mentioned.

Limitations & Caveats

Accuracy for DocVQA is dependent on input image quality and question complexity. The README does not mention specific hardware requirements beyond standard ComfyUI dependencies.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
1
Star History
47 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0.3%
353
Vision-language research paper using LLMs
Created 2 years ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
10 more.

LAVIS by salesforce

0.2%
11k
Library for language-vision AI research
Created 3 years ago
Updated 10 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.