Evaluation metric for text-to-image/video/3D models
Top 86.2% on sourcepulse
VQAScore provides an automated method for evaluating text-to-image, text-to-video, and text-to-3D generation models. It is designed for researchers and practitioners in generative AI, offering a more robust alternative to existing metrics like CLIPScore, particularly for compositional prompts.
How It Works
VQAScore leverages image-to-text generation models (like CLIP-FlanT5, LLaVA, InstructBLIP, and GPT-4o) to assess the alignment between generated visual content and textual descriptions. The core idea is to frame the evaluation as a question-answering task, where the model is asked if the visual content matches the text prompt. This approach is advantageous as it captures nuanced relationships and compositional elements better than simple similarity scores.
Quick Start & Requirements
pip install -e .
(after cloning the repo) or pip install t2v-metrics
.git+https://github.com/openai/CLIP.git
.clip-flant5-xxl
, llava-v1.5-13b
); smaller models are available for limited resources.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
transformers
library (4.36.1
) might be required for text generation tasks.3 months ago
1 day