ComfyUI-Florence2  by kijai

ComfyUI nodes for Florence2 vision-language model inference

created 1 year ago
1,353 stars

Top 30.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a ComfyUI custom node for running Microsoft's Florence-2 Vision Language Model (VLM). It enables users to perform various vision tasks, including object detection, segmentation, captioning, and notably, Document Visual Question Answering (DocVQA), by leveraging Florence-2's prompt-based approach and sequence-to-sequence architecture.

How It Works

Florence-2 is a powerful VLM trained on the extensive FLD-5B dataset, allowing it to handle diverse vision-language tasks through simple text prompts. This node integrates Florence-2 into the ComfyUI workflow, facilitating tasks like DocVQA by allowing users to ask questions about document images and receive answers derived from the visual and textual content.

Quick Start & Requirements

Highlighted Details

  • Adds Document Visual Question Answering (DocVQA) capability to ComfyUI.
  • Supports a wide range of Florence-2 models and tested finetunes.
  • Enables prompt-based vision tasks like captioning, object detection, and segmentation.
  • Integrates seamlessly as a ComfyUI custom node.

Maintenance & Community

No specific community links or maintenance details are provided in the README.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is not mentioned.

Limitations & Caveats

Accuracy for DocVQA is dependent on input image quality and question complexity. The README does not mention specific hardware requirements beyond standard ComfyUI dependencies.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
9
Star History
184 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.