cambrian by cambrian-mllm

Multimodal LLM research paper with vision-centric design

Created 1 year ago

1,977 stars

Top 22.1% on SourcePulse

View on GitHub

7 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Wei-Lin Chiang

Cofounder of LMArena

Yaowei Zheng

Author of LLaMA-Factory

Ross Wightman

Author of timm; CV at Hugging Face

and 3 more!

Project Summary

Cambrian-1 is a family of open-source, vision-centric multimodal large language models (MLLMs) designed for researchers and developers. It offers competitive performance against proprietary models like GPT-4V and Gemini-Pro, with a focus on efficient vision integration and a novel data engine for curated training data.

How It Works

Cambrian-1 utilizes a vision-centric design with a Spatial Vision Aggregator (SVA) module that connects frozen vision encoders to frozen LLMs. This approach allows for a smaller, fixed number of visual tokens, improving efficiency and performance. The models are trained in two stages: first, training the SVA connector, and then instruction tuning using the large-scale Cambrian-7M dataset.

Quick Start & Requirements

TPU Training: pip install -e ".[tpu]", pip install torch~=2.2.0 torch_xla[tpu]~=2.2.0 -f https://storage.googleapis.com/libtpu-releases/index.html
GPU Inference: pip install ".[gpu]"
Dependencies: Python 3.10, PyTorch, TorchXLA (for TPU).
Models: Available on Hugging Face (8B, 13B, 34B).
Demo: Gradio web UI and CLI inference are supported. See Demo Architecture for setup.

Highlighted Details

Offers 8B, 13B, and 34B parameter models with competitive performance against leading proprietary MLLMs.
Features a novel "Internet Data Engine" for collecting science-related visual instruction tuning data, increasing domain data by 400%.
Introduces Cambrian-7M, a curated instruction tuning dataset, and addresses the "Answer Machine" phenomenon with system prompts.
Supports training on TPUs (v4-512 minimum) and provides scripts for both TPU and upcoming GPU training.

Maintenance & Community

Released on 09/09/24 with an MLLM evaluation suite (CV-Bench) on Huggingface.
The project is associated with researchers from NYU, Meta AI, and Columbia University.
Codebase is heavily inspired by LLaVA.

Licensing & Compatibility

The project itself does not impose additional constraints beyond the original licenses of the datasets and base language models used (e.g., Llama community license for LLaMA-3, Vicuna-1.5 license). Users must comply with all applicable terms.

Limitations & Caveats