Cambrian-1 is a family of open-source, vision-centric multimodal large language models (MLLMs) designed for researchers and developers. It offers competitive performance against proprietary models like GPT-4V and Gemini-Pro, with a focus on efficient vision integration and a novel data engine for curated training data.
How It Works
Cambrian-1 utilizes a vision-centric design with a Spatial Vision Aggregator (SVA) module that connects frozen vision encoders to frozen LLMs. This approach allows for a smaller, fixed number of visual tokens, improving efficiency and performance. The models are trained in two stages: first, training the SVA connector, and then instruction tuning using the large-scale Cambrian-7M dataset.
Quick Start & Requirements
- TPU Training:
pip install -e ".[tpu]"
, pip install torch~=2.2.0 torch_xla[tpu]~=2.2.0 -f https://storage.googleapis.com/libtpu-releases/index.html
- GPU Inference:
pip install ".[gpu]"
- Dependencies: Python 3.10, PyTorch, TorchXLA (for TPU).
- Models: Available on Hugging Face (8B, 13B, 34B).
- Demo: Gradio web UI and CLI inference are supported. See Demo Architecture for setup.
Highlighted Details
- Offers 8B, 13B, and 34B parameter models with competitive performance against leading proprietary MLLMs.
- Features a novel "Internet Data Engine" for collecting science-related visual instruction tuning data, increasing domain data by 400%.
- Introduces Cambrian-7M, a curated instruction tuning dataset, and addresses the "Answer Machine" phenomenon with system prompts.
- Supports training on TPUs (v4-512 minimum) and provides scripts for both TPU and upcoming GPU training.
Maintenance & Community
- Released on 09/09/24 with an MLLM evaluation suite (CV-Bench) on Huggingface.
- The project is associated with researchers from NYU, Meta AI, and Columbia University.
- Codebase is heavily inspired by LLaVA.
Licensing & Compatibility
- The project itself does not impose additional constraints beyond the original licenses of the datasets and base language models used (e.g., Llama community license for LLaMA-3, Vicuna-1.5 license). Users must comply with all applicable terms.
Limitations & Caveats
- GPU training scripts are noted as "very soon" at the time of README release.
- SGLang worker setup for the 34B model is "coming soon."
- Users must ensure compliance with all underlying dataset and model licenses.