GeoChat by mbzuai-oryx

CVPR 2024 paper for remote sensing using a grounded VLM

Created 2 years ago

678 stars

Top 50.1% on SourcePulse

Project Summary

GeoChat is a novel Large Vision-Language Model (LVLM) specifically designed for remote sensing (RS) applications, addressing the need for high-resolution imagery analysis and region-level reasoning. It targets researchers and practitioners in geospatial AI, offering robust zero-shot capabilities for tasks like image captioning, visual question answering, and grounded conversations within RS contexts.

How It Works

GeoChat leverages the LLaVA-1.5 architecture, fine-tuned on a custom-built RS multimodal instruction dataset comprising 318k pairs. This approach utilizes a CLIP ViT-L/14 vision tower and a two-layer MLP projector to align visual features with the Vicuna-1.5 language model's embedding space. The model supports region-level inputs and task-specific prompts, enabling it to generate text interleaved with object locations for grounded responses.

Quick Start & Requirements

Install: Clone the repository and install via pip install -e . after creating a geochat conda environment with Python 3.10.
Prerequisites: Python 3.10, ninja, flash-attn (with --no-build-isolation). Training requires 3x A100 GPUs with 40GB memory.
Resources: Training GeoChat-7B takes ~25 hours on 3x A100 GPUs.
Links: Supplementary Material, [arXiv](arxiv link), [Model Zoo](Model Zoo), LoRA Instructions, Evaluation.

Highlighted Details

First grounded LVLM for remote sensing, offering region-level reasoning.
Achieves zero-shot performance on image captioning, VQA, scene classification, and referring object detection.
Introduces a novel RS multimodal instruction dataset and evaluation benchmarks.
Supports multi-turn conversations and grounding with object locations.

Maintenance & Community

The project is associated with Mohamed bin Zayed University of AI, Australian National University, and Linkoping University. Updates are announced via GitHub.

Licensing & Compatibility

The repository is open-sourced, with code, model, dataset, and evaluation scripts released. Specific license details are not explicitly stated in the README, but acknowledgments mention LLaVA and Vicuna, suggesting potential compatibility with their licenses.

Limitations & Caveats

The README indicates that training on fewer GPUs requires adjusting batch size and gradient accumulation. The model's performance is evaluated using greedy decoding, which may differ from beam search outputs.

GeoChat by mbzuai-oryx

Explore Similar Projects

Video-LLaVA by mbzuai-oryx

RS5M by om-ai-lab

GroundingGPT by lzw-lzw

awesome-vision-language-models-for-earth-observation by geoaigroup

Visual-CoT by deepcs233

Awesome-Visual-Grounding by linhuixiao

Groma by FoundationVision

LLaVA-Plus-Codebase by LLaVA-VL

SoM by microsoft

Awesome-Remote-Sensing-Foundation-Models by Jack-bo1220

CogVLM by zai-org

Qwen-VL by QwenLM