CVPR 2024 paper for remote sensing using a grounded VLM
Top 54.7% on sourcepulse
GeoChat is a novel Large Vision-Language Model (LVLM) specifically designed for remote sensing (RS) applications, addressing the need for high-resolution imagery analysis and region-level reasoning. It targets researchers and practitioners in geospatial AI, offering robust zero-shot capabilities for tasks like image captioning, visual question answering, and grounded conversations within RS contexts.
How It Works
GeoChat leverages the LLaVA-1.5 architecture, fine-tuned on a custom-built RS multimodal instruction dataset comprising 318k pairs. This approach utilizes a CLIP ViT-L/14 vision tower and a two-layer MLP projector to align visual features with the Vicuna-1.5 language model's embedding space. The model supports region-level inputs and task-specific prompts, enabling it to generate text interleaved with object locations for grounded responses.
Quick Start & Requirements
pip install -e .
after creating a geochat
conda environment with Python 3.10.ninja
, flash-attn
(with --no-build-isolation
). Training requires 3x A100 GPUs with 40GB memory.Highlighted Details
Maintenance & Community
The project is associated with Mohamed bin Zayed University of AI, Australian National University, and Linkoping University. Updates are announced via GitHub.
Licensing & Compatibility
The repository is open-sourced, with code, model, dataset, and evaluation scripts released. Specific license details are not explicitly stated in the README, but acknowledgments mention LLaVA and Vicuna, suggesting potential compatibility with their licenses.
Limitations & Caveats
The README indicates that training on fewer GPUs requires adjusting batch size and gradient accumulation. The model's performance is evaluated using greedy decoding, which may differ from beam search outputs.
8 months ago
1 week