Visual representation model for multimodal LLMs
Top 48.9% on SourcePulse
This repository provides foundational visual representation models, specifically UNICOM and MLCD, designed for large multimodal language models (LMMs). It targets researchers and developers building advanced vision-language systems, offering improved visual understanding and performance on various downstream tasks.
How It Works
UNICOM employs a joint textual and visual feature clustering approach, grouping 400 million images into 1 million pseudo-classes using a margin-based softmax loss (ArcFace) and partial feature selection (PartialFC) for robust image retrieval. MLCD enhances traditional contrastive learning by clustering large datasets into millions of centers and assigning multiple closest clusters as labels, addressing limitations in encoding complex image semantics and improving performance on multimodal tasks.
Quick Start & Requirements
pip install git+https://github.com/huggingface/transformers@v4.51.3-MLCD-preview
transformers
library (master branch or specific preview version).Highlighted Details
transformers
library.Maintenance & Community
transformers
).Licensing & Compatibility
transformers
), but specific model licenses should be verified on Hugging Face.Limitations & Caveats
transformers
installation requires a specific, potentially unstable, preview branch.5 days ago
1 day