t2v_metrics by linzhiqiu

Evaluation metric for text-to-image/video/3D models

Created 2 years ago

370 stars

Top 76.4% on SourcePulse

Project Summary

VQAScore provides an automated method for evaluating text-to-image, text-to-video, and text-to-3D generation models. It is designed for researchers and practitioners in generative AI, offering a more robust alternative to existing metrics like CLIPScore, particularly for compositional prompts.

How It Works

VQAScore leverages image-to-text generation models (like CLIP-FlanT5, LLaVA, InstructBLIP, and GPT-4o) to assess the alignment between generated visual content and textual descriptions. The core idea is to frame the evaluation as a question-answering task, where the model is asked if the visual content matches the text prompt. This approach is advantageous as it captures nuanced relationships and compositional elements better than simple similarity scores.

Quick Start & Requirements

Install: pip install -e . (after cloning the repo) or pip install t2v-metrics.
Prerequisites: Python 3.10, PyTorch, git+https://github.com/openai/CLIP.git.
GPU: Recommended 40GB GPU for larger models (e.g., clip-flant5-xxl, llava-v1.5-13b); smaller models are available for limited resources.
Docs: Project Page, VQAScore Page, VQAScore Demo, GenAI-Bench Page, GenAI-Bench Demo.

Highlighted Details

VQAScore significantly outperforms CLIPScore and PickScore on compositional text prompts.
Supports multiple evaluation models including CLIP-FlanT5, LLaVA-1.5, InstructBLIP, and GPT-4o.
Includes implementations for other metrics like CLIPScore, BLIPv2Score, PickScore, HPSv2Score, and ImageReward for comparison.
Provides tools to reproduce results from the VQAScore and GenAI-Bench papers.

Maintenance & Community

The project is associated with ECCV 2024 and CVPR 2024 Best Short Paper awards.
Highlighted in Google's Imagen3 report as a strong replacement for CLIPScore.

Licensing & Compatibility

The repository itself does not explicitly state a license in the README. The underlying models used (e.g., CLIP, FlanT5, LLaVA) have their own licenses, which may impose restrictions on commercial use or redistribution.

Limitations & Caveats

The README mentions that changing the default question and answer templates is not recommended for reproducibility.
A specific version of the transformers library (4.36.1) might be required for text generation tasks.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

5 stars in the last 30 days

Explore Similar Projects

GPTEval3D by 3DTopia

Evaluation metric for text-to-3D generative models

Created 2 years ago

Updated 1 year ago

T2I-CompBench by Karine-Huang

Evaluation benchmark for compositional text-to-image generation

Created 2 years ago

Updated 2 weeks ago

LongVA by EvolvingLMMs-Lab

Vision-language model for long context understanding

Created 1 year ago

Updated 9 months ago

Awesome-Evaluation-of-Visual-Generation by ziqihuangg

Evaluation resources for visual generation models

Created 1 year ago

Updated 3 months ago

BLIVA by mlpc-ucsd

Multimodal LLM for text-rich visual question answering (AAAI 2024 paper)

Created 2 years ago

Updated 1 year ago

MAGIC by yxuansu

Framework for image-guided text generation using language models

Created 3 years ago

Updated 3 years ago

tarsier by bytedance

Video-language model for high-quality video descriptions and video understanding

Created 1 year ago

Updated 5 months ago

CelebV-Text by celebv-text

Dataset for facial text-to-video generation research

Created 3 years ago

Updated 2 years ago

VBench by Vchitect

Benchmark suite for video generation models

Created 2 years ago

Updated 2 days ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo),

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and

1 more.

CLIP_benchmark by LAION-AI

CLI tool for CLIP-like model evaluation across diverse tasks/datasets

Created 3 years ago

Updated 1 month ago

text2video by bravekingzhang

CLI tool for text-to-video generation

Created 2 years ago

Updated 1 year ago

Starred by

Junyang Lin

Junyang Lin(Core Maintainer at Alibaba Qwen),

Andreas Jansson

Andreas Jansson(Cofounder of Replicate), and

1 more.

Awesome-Text-to-Image by Yutong-Zhou-cv

Survey on text-to-image generation/synthesis

Created 5 years ago

Updated 2 months ago

Feedback? Help us improve.