Discover and explore top open-source AI tools and projects—updated daily.
facebookresearchVision Language Models for metric depth estimation
Top 95.3% on SourcePulse
Metric Depth from Vision Language Models (VLMs) addresses the challenge of achieving high accuracy in metric depth estimation using standard VLMs without architectural modifications. This approach benefits researchers and engineers by enabling a single, unified VLM to handle diverse 3D understanding tasks, such as speed estimation and metric scale camera pose estimation, which previously required specialized vision models or complex pipelines.
How It Works
DepthLM leverages standard text-based Supervised Fine-Tuning (SFT) on existing Vision Language Models. It demonstrates that VLMs can reach accuracy comparable to pure vision models for metric depth estimation without needing custom components like dense prediction heads or specific regression/regularization losses. This architectural simplicity is key to its versatility across various 3D perception tasks.
Quick Start & Requirements
conda create -n DepthLM python=3.12) and install dependencies (pip install -r requirements.txt). The code is tested with transformers version 4.51.1.examples/ibims1. Data curation code is provided for reproduction.examples/ibims1).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The CC-BY-NC license imposes significant restrictions on commercial applications. Users must prepare their own datasets or use the provided example data, as curated datasets are not directly released due to legal reasons.
1 month ago
Inactive