GeoChat  by mbzuai-oryx

CVPR 2024 paper for remote sensing using a grounded VLM

created 1 year ago
607 stars

Top 54.7% on sourcepulse

GitHubView on GitHub
Project Summary

GeoChat is a novel Large Vision-Language Model (LVLM) specifically designed for remote sensing (RS) applications, addressing the need for high-resolution imagery analysis and region-level reasoning. It targets researchers and practitioners in geospatial AI, offering robust zero-shot capabilities for tasks like image captioning, visual question answering, and grounded conversations within RS contexts.

How It Works

GeoChat leverages the LLaVA-1.5 architecture, fine-tuned on a custom-built RS multimodal instruction dataset comprising 318k pairs. This approach utilizes a CLIP ViT-L/14 vision tower and a two-layer MLP projector to align visual features with the Vicuna-1.5 language model's embedding space. The model supports region-level inputs and task-specific prompts, enabling it to generate text interleaved with object locations for grounded responses.

Quick Start & Requirements

  • Install: Clone the repository and install via pip install -e . after creating a geochat conda environment with Python 3.10.
  • Prerequisites: Python 3.10, ninja, flash-attn (with --no-build-isolation). Training requires 3x A100 GPUs with 40GB memory.
  • Resources: Training GeoChat-7B takes ~25 hours on 3x A100 GPUs.
  • Links: Supplementary Material, [arXiv](arxiv link), [Model Zoo](Model Zoo), LoRA Instructions, Evaluation.

Highlighted Details

  • First grounded LVLM for remote sensing, offering region-level reasoning.
  • Achieves zero-shot performance on image captioning, VQA, scene classification, and referring object detection.
  • Introduces a novel RS multimodal instruction dataset and evaluation benchmarks.
  • Supports multi-turn conversations and grounding with object locations.

Maintenance & Community

The project is associated with Mohamed bin Zayed University of AI, Australian National University, and Linkoping University. Updates are announced via GitHub.

Licensing & Compatibility

The repository is open-sourced, with code, model, dataset, and evaluation scripts released. Specific license details are not explicitly stated in the README, but acknowledgments mention LLaVA and Vicuna, suggesting potential compatibility with their licenses.

Limitations & Caveats

The README indicates that training on fewer GPUs requires adjusting batch size and gradient accumulation. The model's performance is evaluated using greedy decoding, which may differ from beam search outputs.

Health Check
Last commit

8 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
55 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.