t2v_metrics  by linzhiqiu

Evaluation metric for text-to-image/video/3D models

created 1 year ago
319 stars

Top 86.2% on sourcepulse

GitHubView on GitHub
Project Summary

VQAScore provides an automated method for evaluating text-to-image, text-to-video, and text-to-3D generation models. It is designed for researchers and practitioners in generative AI, offering a more robust alternative to existing metrics like CLIPScore, particularly for compositional prompts.

How It Works

VQAScore leverages image-to-text generation models (like CLIP-FlanT5, LLaVA, InstructBLIP, and GPT-4o) to assess the alignment between generated visual content and textual descriptions. The core idea is to frame the evaluation as a question-answering task, where the model is asked if the visual content matches the text prompt. This approach is advantageous as it captures nuanced relationships and compositional elements better than simple similarity scores.

Quick Start & Requirements

  • Install: pip install -e . (after cloning the repo) or pip install t2v-metrics.
  • Prerequisites: Python 3.10, PyTorch, git+https://github.com/openai/CLIP.git.
  • GPU: Recommended 40GB GPU for larger models (e.g., clip-flant5-xxl, llava-v1.5-13b); smaller models are available for limited resources.
  • Docs: Project Page, VQAScore Page, VQAScore Demo, GenAI-Bench Page, GenAI-Bench Demo.

Highlighted Details

  • VQAScore significantly outperforms CLIPScore and PickScore on compositional text prompts.
  • Supports multiple evaluation models including CLIP-FlanT5, LLaVA-1.5, InstructBLIP, and GPT-4o.
  • Includes implementations for other metrics like CLIPScore, BLIPv2Score, PickScore, HPSv2Score, and ImageReward for comparison.
  • Provides tools to reproduce results from the VQAScore and GenAI-Bench papers.

Maintenance & Community

  • The project is associated with ECCV 2024 and CVPR 2024 Best Short Paper awards.
  • Highlighted in Google's Imagen3 report as a strong replacement for CLIPScore.

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the README. The underlying models used (e.g., CLIP, FlanT5, LLaVA) have their own licenses, which may impose restrictions on commercial use or redistribution.

Limitations & Caveats

  • The README mentions that changing the default question and answer templates is not recommended for reproducibility.
  • A specific version of the transformers library (4.36.1) might be required for text generation tasks.
Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
31 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.