Visual scoring foundation model for image/video quality and aesthetics assessment
Top 65.6% on sourcepulse
Q-Align is an all-in-one foundation model for visual scoring tasks, including Image Quality Assessment (IQA), Image Aesthetic Assessment (IAA), and Video Quality Assessment (VQA). It is designed for researchers and developers working with multimodal large language models (LMMs) who need a unified approach to evaluating visual content. The model offers efficient fine-tuning capabilities for downstream datasets and aims to simplify the process of visual scoring.
How It Works
Q-Align leverages a LLaVA-style architecture, integrating visual understanding with language models. It employs discrete, text-defined levels to teach LMMs how to perform visual scoring. This approach allows for a unified model that can handle diverse scoring tasks by mapping visual inputs to predefined textual categories or scores, enabling efficient fine-tuning and adaptation to specific datasets.
Quick Start & Requirements
AutoModel
or by installing the repository (pip install -e .
). For training, additional dependencies are required (pip install -e ".[train]"
and flash_attn
).Highlighted Details
transformers==4.36.1
.Maintenance & Community
The project is associated with Nanyang Technological University and Shanghai Jiao Tong University. Contact information for authors is provided for queries.
Licensing & Compatibility
The repository appears to be released under a permissive license, but specific details are not explicitly stated in the README. Compatibility with commercial or closed-source projects should be verified.
Limitations & Caveats
The README notes that the v1.1 update is incompatible with older versions (v1.0.1/v1.0.0 and before). Training from scratch requires substantial GPU resources. The specific license for commercial use is not clearly defined.
4 months ago
1 week