DepthLM_Official  by facebookresearch

Vision Language Models for metric depth estimation

Created 2 months ago
270 stars

Top 95.3% on SourcePulse

GitHubView on GitHub
Project Summary

Metric Depth from Vision Language Models (VLMs) addresses the challenge of achieving high accuracy in metric depth estimation using standard VLMs without architectural modifications. This approach benefits researchers and engineers by enabling a single, unified VLM to handle diverse 3D understanding tasks, such as speed estimation and metric scale camera pose estimation, which previously required specialized vision models or complex pipelines.

How It Works

DepthLM leverages standard text-based Supervised Fine-Tuning (SFT) on existing Vision Language Models. It demonstrates that VLMs can reach accuracy comparable to pure vision models for metric depth estimation without needing custom components like dense prediction heads or specific regression/regularization losses. This architectural simplicity is key to its versatility across various 3D perception tasks.

Quick Start & Requirements

  • Installation: Create a Conda environment (conda create -n DepthLM python=3.12) and install dependencies (pip install -r requirements.txt). The code is tested with transformers version 4.51.1.
  • Prerequisites: Python 3.12.
  • Data: Requires images and corresponding camera intrinsics/3D labels. Example data from the iBims1 dataset is available at examples/ibims1. Data curation code is provided for reproduction.
  • Links: Model download (🤗), example data (examples/ibims1).

Highlighted Details

  • Achieves comparable accuracy to pure vision models on metric depth estimation using standard VLMs.
  • Enables a unified VLM for complex 3D understanding tasks (e.g., speed/time estimation, metric scale camera pose estimation).
  • Requires no architectural changes to the base VLM, such as dense prediction heads or specialized regression/regularization losses.

Maintenance & Community

Licensing & Compatibility

  • License: FAIR CC-BY-NC licensed.
  • Compatibility: The CC-BY-NC license restricts commercial use and derivative works, requiring attribution and non-commercial distribution.

Limitations & Caveats

The CC-BY-NC license imposes significant restrictions on commercial applications. Users must prepare their own datasets or use the provided example data, as curated datasets are not directly released due to legal reasons.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
33 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.