ComfyUI nodes for Florence2 vision-language model inference
Top 30.3% on sourcepulse
This repository provides a ComfyUI custom node for running Microsoft's Florence-2 Vision Language Model (VLM). It enables users to perform various vision tasks, including object detection, segmentation, captioning, and notably, Document Visual Question Answering (DocVQA), by leveraging Florence-2's prompt-based approach and sequence-to-sequence architecture.
How It Works
Florence-2 is a powerful VLM trained on the extensive FLD-5B dataset, allowing it to handle diverse vision-language tasks through simple text prompts. This node integrates Florence-2 into the ComfyUI workflow, facilitating tasks like DocVQA by allowing users to ask questions about document images and receive answers derived from the visual and textual content.
Quick Start & Requirements
ComfyUI/custom_nodes
directory.pip install -r requirements.txt
(requires transformers>=4.38.0
).ComfyUI/models/LLM
.Highlighted Details
Maintenance & Community
No specific community links or maintenance details are provided in the README.
Licensing & Compatibility
The README does not specify a license. Compatibility for commercial use or closed-source linking is not mentioned.
Limitations & Caveats
Accuracy for DocVQA is dependent on input image quality and question complexity. The README does not mention specific hardware requirements beyond standard ComfyUI dependencies.
1 day ago
1 day