DepthLM_Official by facebookresearch

Vision Language Models for metric depth estimation

Created 5 months ago

320 stars

Top 85.0% on SourcePulse

Project Summary

Metric Depth from Vision Language Models (VLMs) addresses the challenge of achieving high accuracy in metric depth estimation using standard VLMs without architectural modifications. This approach benefits researchers and engineers by enabling a single, unified VLM to handle diverse 3D understanding tasks, such as speed estimation and metric scale camera pose estimation, which previously required specialized vision models or complex pipelines.

How It Works

DepthLM leverages standard text-based Supervised Fine-Tuning (SFT) on existing Vision Language Models. It demonstrates that VLMs can reach accuracy comparable to pure vision models for metric depth estimation without needing custom components like dense prediction heads or specific regression/regularization losses. This architectural simplicity is key to its versatility across various 3D perception tasks.

Quick Start & Requirements

Installation: Create a Conda environment (conda create -n DepthLM python=3.12) and install dependencies (pip install -r requirements.txt). The code is tested with transformers version 4.51.1.
Prerequisites: Python 3.12.
Data: Requires images and corresponding camera intrinsics/3D labels. Example data from the iBims1 dataset is available at examples/ibims1. Data curation code is provided for reproduction.
Links: Model download (🤗), example data (examples/ibims1).

Highlighted Details

Achieves comparable accuracy to pure vision models on metric depth estimation using standard VLMs.
Enables a unified VLM for complex 3D understanding tasks (e.g., speed/time estimation, metric scale camera pose estimation).
Requires no architectural changes to the base VLM, such as dense prediction heads or specialized regression/regularization losses.

Maintenance & Community

Contact: Zhipeng Cai (Meta Inc), homepage: https://zhipengcai.github.io/, email: czptc2h at gmail dot com.

Licensing & Compatibility

License: FAIR CC-BY-NC licensed.
Compatibility: The CC-BY-NC license restricts commercial use and derivative works, requiring attribution and non-commercial distribution.

Limitations & Caveats

The CC-BY-NC license imposes significant restrictions on commercial applications. Users must prepare their own datasets or use the provided example data, as curated datasets are not directly released due to legal reasons.

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

15 stars in the last 30 days