ComfyUI-Florence2  by kijai

ComfyUI nodes for Florence2 vision-language model inference

Created 1 year ago
1,695 stars

Top 24.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a ComfyUI custom node for running Microsoft's Florence-2 Vision Language Model (VLM). It enables users to perform various vision tasks, including object detection, segmentation, captioning, and notably, Document Visual Question Answering (DocVQA), by leveraging Florence-2's prompt-based approach and sequence-to-sequence architecture.

How It Works

Florence-2 is a powerful VLM trained on the extensive FLD-5B dataset, allowing it to handle diverse vision-language tasks through simple text prompts. This node integrates Florence-2 into the ComfyUI workflow, facilitating tasks like DocVQA by allowing users to ask questions about document images and receive answers derived from the visual and textual content.

Quick Start & Requirements

Highlighted Details

  • Adds Document Visual Question Answering (DocVQA) capability to ComfyUI.
  • Supports a wide range of Florence-2 models and tested finetunes.
  • Enables prompt-based vision tasks like captioning, object detection, and segmentation.
  • Integrates seamlessly as a ComfyUI custom node.

Maintenance & Community

No specific community links or maintenance details are provided in the README.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is not mentioned.

Limitations & Caveats

Accuracy for DocVQA is dependent on input image quality and question complexity. The README does not mention specific hardware requirements beyond standard ComfyUI dependencies.

Health Check
Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
30 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0.3%
356
Vision-language research paper using LLMs
Created 2 years ago
Updated 10 months ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
10 more.

LAVIS by salesforce

0.0%
11k
Library for language-vision AI research
Created 3 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.2%
5k
MoE vision-language model for multimodal understanding
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.