cambrian  by cambrian-mllm

Multimodal LLM research paper with vision-centric design

created 1 year ago
1,932 stars

Top 23.1% on sourcepulse

GitHubView on GitHub
Project Summary

Cambrian-1 is a family of open-source, vision-centric multimodal large language models (MLLMs) designed for researchers and developers. It offers competitive performance against proprietary models like GPT-4V and Gemini-Pro, with a focus on efficient vision integration and a novel data engine for curated training data.

How It Works

Cambrian-1 utilizes a vision-centric design with a Spatial Vision Aggregator (SVA) module that connects frozen vision encoders to frozen LLMs. This approach allows for a smaller, fixed number of visual tokens, improving efficiency and performance. The models are trained in two stages: first, training the SVA connector, and then instruction tuning using the large-scale Cambrian-7M dataset.

Quick Start & Requirements

  • TPU Training: pip install -e ".[tpu]", pip install torch~=2.2.0 torch_xla[tpu]~=2.2.0 -f https://storage.googleapis.com/libtpu-releases/index.html
  • GPU Inference: pip install ".[gpu]"
  • Dependencies: Python 3.10, PyTorch, TorchXLA (for TPU).
  • Models: Available on Hugging Face (8B, 13B, 34B).
  • Demo: Gradio web UI and CLI inference are supported. See Demo Architecture for setup.

Highlighted Details

  • Offers 8B, 13B, and 34B parameter models with competitive performance against leading proprietary MLLMs.
  • Features a novel "Internet Data Engine" for collecting science-related visual instruction tuning data, increasing domain data by 400%.
  • Introduces Cambrian-7M, a curated instruction tuning dataset, and addresses the "Answer Machine" phenomenon with system prompts.
  • Supports training on TPUs (v4-512 minimum) and provides scripts for both TPU and upcoming GPU training.

Maintenance & Community

  • Released on 09/09/24 with an MLLM evaluation suite (CV-Bench) on Huggingface.
  • The project is associated with researchers from NYU, Meta AI, and Columbia University.
  • Codebase is heavily inspired by LLaVA.

Licensing & Compatibility

  • The project itself does not impose additional constraints beyond the original licenses of the datasets and base language models used (e.g., Llama community license for LLaMA-3, Vicuna-1.5 license). Users must comply with all applicable terms.

Limitations & Caveats

  • GPU training scripts are noted as "very soon" at the time of README release.
  • SGLang worker setup for the 34B model is "coming soon."
  • Users must ensure compliance with all underlying dataset and model licenses.
Health Check
Last commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
39 stars in the last 90 days

Explore Similar Projects

Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), Nathan Lambert Nathan Lambert(AI Researcher at AI2), and
1 more.

unified-io-2 by allenai

0.3%
619
Unified-IO 2 code for training, inference, and demo
created 1 year ago
updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), and
10 more.

TinyLlama by jzhang38

0.3%
9k
Tiny pretraining project for a 1.1B Llama model
created 1 year ago
updated 1 year ago
Feedback? Help us improve.