unicom  by deepglint

Visual representation model for multimodal LLMs

created 2 years ago
697 stars

Top 48.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides foundational visual representation models, specifically UNICOM and MLCD, designed for large multimodal language models (LMMs). It targets researchers and developers building advanced vision-language systems, offering improved visual understanding and performance on various downstream tasks.

How It Works

UNICOM employs a joint textual and visual feature clustering approach, grouping 400 million images into 1 million pseudo-classes using a margin-based softmax loss (ArcFace) and partial feature selection (PartialFC) for robust image retrieval. MLCD enhances traditional contrastive learning by clustering large datasets into millions of centers and assigning multiple closest clusters as labels, addressing limitations in encoding complex image semantics and improving performance on multimodal tasks.

Quick Start & Requirements

  • Install via pip: pip install git+https://github.com/huggingface/transformers@v4.51.3-MLCD-preview
  • Requires transformers library (master branch or specific preview version).
  • GPU recommended for efficient feature extraction.
  • Official Hugging Face models available for direct use.

Highlighted Details

  • MLCD models (ViT-L-14@336px, ViT-bigG-14@336px, ViT-bigG-14@448px) show state-of-the-art performance on benchmarks like ChartQA, DocVQA, and MMMU.
  • MLCD-bigG-14-448px with RoPE2D achieves top scores across multiple vision-language benchmarks.
  • MLCD models are integrated into the Hugging Face transformers library.
  • MLCD-Embodied-7B demonstrates competitive performance against LLaVA OneVision-7B and GPT-4v on various multimodal tasks.

Maintenance & Community

  • Active development with recent releases and integrations (e.g., Hugging Face transformers).
  • Key contributors include Bin Qin, Lan Wu, Haiqiang Jiang, and Yuling Wu.
  • Research papers published at ICLR2023 (UNICOM) and ECCV2024 (MLCD).

Licensing & Compatibility

  • The repository itself does not explicitly state a license.
  • Models are available via Hugging Face, typically under permissive licenses (e.g., Apache 2.0 for transformers), but specific model licenses should be verified on Hugging Face.

Limitations & Caveats

  • The transformers installation requires a specific, potentially unstable, preview branch.
  • While benchmarks are provided, direct reproduction steps for all models might require referring to linked papers or specific sub-directories.
Health Check
Last commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.