RS5M  by om-ai-lab

Vision-language dataset and model for remote sensing

created 2 years ago
270 stars

Top 95.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the RS5M dataset and GeoRSCLIP, a vision-language foundation model tailored for remote sensing (RS) applications. It addresses the challenge of adapting general vision-language models (VLMs) to the specialized domain of remote sensing, enabling improved performance on downstream tasks like zero-shot classification, cross-modal retrieval, and semantic localization. The target audience includes researchers and practitioners in remote sensing, computer vision, and natural language processing.

How It Works

The project introduces RS5M, a 5-million image-text pair dataset for remote sensing, created by filtering existing datasets and using VLMs for captioning. GeoRSCLIP is a domain-adapted VLM, fine-tuned on RS5M using parameter-efficient fine-tuning (PEFT) methods. This approach bridges the gap between general VLMs and domain-specific tasks, offering improved transfer learning capabilities.

Quick Start & Requirements

  • Installation: Clone the repository from Hugging Face (git clone https://huggingface.co/Zilun/GeoRSCLIP). Install PyTorch (tested with 2.0.1/CUDA 11.8 and 2.1.0/CUDA 12.1) and other dependencies via pip.
  • Prerequisites: PyTorch with CUDA support, Pillow, pandas, scikit-learn, ftfy, tqdm, matplotlib, transformers, adapter-transformers, open_clip_torch, pycocotools, timm, clip-benchmark, torch-rs.
  • Usage: Unzip test data and run inference with python codebase/inference.py --ckpt-path <path_to_model> --test-dataset-dir <path_to_data>. Model checkpoints and dataset links are available on Hugging Face.
  • Dataset: RS5M dataset is ~500GB and available in WebDataset format or as raw image files.

Highlighted Details

  • GeoRSCLIP achieves significant improvements (3%-20%) over baselines in Zero-shot Classification, Remote Sensing Cross-Modal Text–Image Retrieval (3%-6%), and Semantic Localization (4%-5%) tasks.
  • The RS5M dataset is the first large-scale image-text paired dataset specifically for remote sensing.
  • Offers pre-trained models for both CLIP-like (GeoRSCLIP) and Stable Diffusion (GeoRSSD) architectures adapted for remote sensing.
  • Data loading throughput of ~1800 images/sec is reported with specific hardware.

Maintenance & Community

  • The project is associated with the om-ai-lab.
  • Contact email: zilun.zhang@zju.edu.cn.
  • A Slack group is available for community interaction.
  • Links to related projects and papers are provided.

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the README.
  • The RS5M dataset is available via Hugging Face and Baidu Disk.
  • Pre-trained models are hosted on Hugging Face.

Limitations & Caveats

  • The README does not specify a license for the code or models, which may impact commercial use.
  • The dataset is large (~500GB), requiring substantial storage and bandwidth.
  • Specific hardware configurations are mentioned for testing, implying potential performance variations on different setups.
Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.