RS5M  by om-ai-lab

Vision-language dataset and model for remote sensing

Created 2 years ago
281 stars

Top 92.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the RS5M dataset and GeoRSCLIP, a vision-language foundation model tailored for remote sensing (RS) applications. It addresses the challenge of adapting general vision-language models (VLMs) to the specialized domain of remote sensing, enabling improved performance on downstream tasks like zero-shot classification, cross-modal retrieval, and semantic localization. The target audience includes researchers and practitioners in remote sensing, computer vision, and natural language processing.

How It Works

The project introduces RS5M, a 5-million image-text pair dataset for remote sensing, created by filtering existing datasets and using VLMs for captioning. GeoRSCLIP is a domain-adapted VLM, fine-tuned on RS5M using parameter-efficient fine-tuning (PEFT) methods. This approach bridges the gap between general VLMs and domain-specific tasks, offering improved transfer learning capabilities.

Quick Start & Requirements

  • Installation: Clone the repository from Hugging Face (git clone https://huggingface.co/Zilun/GeoRSCLIP). Install PyTorch (tested with 2.0.1/CUDA 11.8 and 2.1.0/CUDA 12.1) and other dependencies via pip.
  • Prerequisites: PyTorch with CUDA support, Pillow, pandas, scikit-learn, ftfy, tqdm, matplotlib, transformers, adapter-transformers, open_clip_torch, pycocotools, timm, clip-benchmark, torch-rs.
  • Usage: Unzip test data and run inference with python codebase/inference.py --ckpt-path <path_to_model> --test-dataset-dir <path_to_data>. Model checkpoints and dataset links are available on Hugging Face.
  • Dataset: RS5M dataset is ~500GB and available in WebDataset format or as raw image files.

Highlighted Details

  • GeoRSCLIP achieves significant improvements (3%-20%) over baselines in Zero-shot Classification, Remote Sensing Cross-Modal Text–Image Retrieval (3%-6%), and Semantic Localization (4%-5%) tasks.
  • The RS5M dataset is the first large-scale image-text paired dataset specifically for remote sensing.
  • Offers pre-trained models for both CLIP-like (GeoRSCLIP) and Stable Diffusion (GeoRSSD) architectures adapted for remote sensing.
  • Data loading throughput of ~1800 images/sec is reported with specific hardware.

Maintenance & Community

  • The project is associated with the om-ai-lab.
  • Contact email: zilun.zhang@zju.edu.cn.
  • A Slack group is available for community interaction.
  • Links to related projects and papers are provided.

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the README.
  • The RS5M dataset is available via Hugging Face and Baidu Disk.
  • Pre-trained models are hosted on Hugging Face.

Limitations & Caveats

  • The README does not specify a license for the code or models, which may impact commercial use.
  • The dataset is large (~500GB), requiring substantial storage and bandwidth.
  • Specific hardware configurations are mentioned for testing, implying potential performance variations on different setups.
Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Simon Willison Simon Willison(Coauthor of Django), and
10 more.

LAVIS by salesforce

0.2%
11k
Library for language-vision AI research
Created 3 years ago
Updated 10 months ago
Feedback? Help us improve.