unicom by deepglint

Visual representation model for multimodal LLMs

Created 2 years ago

699 stars

Top 48.9% on SourcePulse

1 Expert Loves This Project

hiyouga

Author of LLaMA-Factory

Project Summary

This repository provides foundational visual representation models, specifically UNICOM and MLCD, designed for large multimodal language models (LMMs). It targets researchers and developers building advanced vision-language systems, offering improved visual understanding and performance on various downstream tasks.

How It Works

UNICOM employs a joint textual and visual feature clustering approach, grouping 400 million images into 1 million pseudo-classes using a margin-based softmax loss (ArcFace) and partial feature selection (PartialFC) for robust image retrieval. MLCD enhances traditional contrastive learning by clustering large datasets into millions of centers and assigning multiple closest clusters as labels, addressing limitations in encoding complex image semantics and improving performance on multimodal tasks.

Quick Start & Requirements

Install via pip: pip install git+https://github.com/huggingface/transformers@v4.51.3-MLCD-preview
Requires transformers library (master branch or specific preview version).
GPU recommended for efficient feature extraction.
Official Hugging Face models available for direct use.

Highlighted Details

MLCD models (ViT-L-14@336px, ViT-bigG-14@336px, ViT-bigG-14@448px) show state-of-the-art performance on benchmarks like ChartQA, DocVQA, and MMMU.
MLCD-bigG-14-448px with RoPE2D achieves top scores across multiple vision-language benchmarks.
MLCD models are integrated into the Hugging Face transformers library.
MLCD-Embodied-7B demonstrates competitive performance against LLaVA OneVision-7B and GPT-4v on various multimodal tasks.

Maintenance & Community

Active development with recent releases and integrations (e.g., Hugging Face transformers).
Key contributors include Bin Qin, Lan Wu, Haiqiang Jiang, and Yuling Wu.
Research papers published at ICLR2023 (UNICOM) and ECCV2024 (MLCD).

Licensing & Compatibility

The repository itself does not explicitly state a license.
Models are available via Hugging Face, typically under permissive licenses (e.g., Apache 2.0 for transformers), but specific model licenses should be verified on Hugging Face.

Limitations & Caveats

The transformers installation requires a specific, potentially unstable, preview branch.
While benchmarks are provided, direct reproduction steps for all models might require referring to linked papers or specific sub-directories.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

0 stars in the last 30 days

Explore Similar Projects

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI) and

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

DreamLLM by RunpeiDong

Multimodal LLM framework for comprehension and creation

Created 2 years ago

Updated 1 year ago

awesome-self-supervised-multimodal-learning by ys-zong

Curated list of self-supervised multimodal learning resources

Created 2 years ago

Updated 1 year ago

Starred by

Stas Bekman

Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake),

Douwe Kiela

Douwe Kiela(Cofounder of Contextual AI), and

1 more.

lens by ContextualAI

Vision-language research paper using LLMs

Created 2 years ago

Updated 4 months ago

R1-Onevision by Fancy-MLLM

Multimodal LLM for visual reasoning tasks

Created 9 months ago

Updated 7 months ago

Visual-CoT by deepcs233

Research paper advancing multimodal language models

Created 1 year ago

Updated 11 months ago

UniWorld by PKU-YuanGroup

Unified framework for visual tasks

Created 1 year ago

Updated 6 days ago

Awesome-Unified-Multimodal-Models by AIDC-AI

Curated list of unified multimodal models, papers, and datasets

Created 6 months ago

Updated 3 months ago

VLM2Vec by TIGER-AI-Lab

Research paper for multimodal embeddings using vision-language models

Created 1 year ago

Updated 2 weeks ago

Rex-Omni by IDEA-Research

Multimodal LLM for versatile visual perception via next-point prediction

Created 1 month ago

Updated 6 days ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

Ovis by AIDC-AI

MLLM architecture aligning visual/textual embeddings

Created 1 year ago

Updated 2 months ago

MiniGPT-4-ZH by RiseInRose

Vision-language model enhances understanding using LLMs

Created 2 years ago

Updated 23 hours ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo) and

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

DeepSeek-VL by deepseek-ai

Vision-language model for real-world applications (research paper)

Created 1 year ago

Updated 1 year ago

Feedback? Help us improve.