cambrian-s by cambrian-mllm

Multimodal LLM for advanced video spatial understanding

Created 5 months ago

504 stars

Top 62.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Saining Xie

Professor at NYU

Project Summary

Cambrian-S addresses the challenge of spatial supersensing in video, offering a suite of multimodal large language models (MLLMs) optimized for enhanced spatial reasoning. Aimed at AI researchers and practitioners in video analysis, it provides significant improvements on spatial understanding benchmarks while matching general video comprehension, supported by new datasets and evaluation benchmarks.

How It Works

The Cambrian-S models combine Qwen2.5 base LLMs with SigLIP2 vision encoders, available in various sizes from 0.5B to 7B parameters. Trained via a "Predictive Sensing" methodology, these models are engineered for superior spatial relationship comprehension within videos, differentiating them from standard MLLMs for tasks demanding precise spatial awareness.

Quick Start & Requirements

Model weights are accessible via Hugging Face repositories (e.g., nyu-visionx/Cambrian-S-7B-LFP). The evaluation suite is released; however, the TPU-based training code is still undergoing cleaning and reorganization. A new dataset, VSI-590K, and a benchmark, VSI-SUPER, are also provided to facilitate research in spatial video understanding. Specific hardware or software prerequisites beyond a standard ML environment are not detailed.

Highlighted Details

Demonstrates significant improvements over state-of-the-art methods on spatial understanding benchmarks.
Maintains competitive performance on general video understanding tasks like Perception Test and EgoSchema.
Introduces VSI-SUPER, a dedicated benchmark for spatial supersensing evaluation.
Offers multiple model sizes (0.5B, 1.5B, 3B, 7B parameters) for flexibility.

Maintenance & Community

The project is associated with a strong research team, including prominent figures like Yann LeCun, Li Fei-Fei, and Rob Fergus. Several related projects and publications are listed, indicating active development and a robust research foundation. However, direct links to community channels (e.g., Discord, Slack) or a public roadmap are not present in the provided README.

Licensing & Compatibility

The README does not explicitly state the license for the model weights, training code, or dataset. Given the arXiv preprint citations, it is likely intended for research purposes, and commercial use compatibility requires further clarification.

Limitations & Caveats

The training code is not yet fully released or stabilized, requiring users to wait for further updates. The project's release date is noted as November 6, 2025. A significant adoption blocker is the lack of specified licensing terms, hindering commercial application assessment.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

17 stars in the last 30 days