RS5M by om-ai-lab

Vision-language dataset and model for remote sensing

Created 2 years ago

295 stars

Top 89.8% on SourcePulse

Project Summary

This repository provides the RS5M dataset and GeoRSCLIP, a vision-language foundation model tailored for remote sensing (RS) applications. It addresses the challenge of adapting general vision-language models (VLMs) to the specialized domain of remote sensing, enabling improved performance on downstream tasks like zero-shot classification, cross-modal retrieval, and semantic localization. The target audience includes researchers and practitioners in remote sensing, computer vision, and natural language processing.

How It Works

The project introduces RS5M, a 5-million image-text pair dataset for remote sensing, created by filtering existing datasets and using VLMs for captioning. GeoRSCLIP is a domain-adapted VLM, fine-tuned on RS5M using parameter-efficient fine-tuning (PEFT) methods. This approach bridges the gap between general VLMs and domain-specific tasks, offering improved transfer learning capabilities.

Quick Start & Requirements

Installation: Clone the repository from Hugging Face (git clone https://huggingface.co/Zilun/GeoRSCLIP). Install PyTorch (tested with 2.0.1/CUDA 11.8 and 2.1.0/CUDA 12.1) and other dependencies via pip.
Prerequisites: PyTorch with CUDA support, Pillow, pandas, scikit-learn, ftfy, tqdm, matplotlib, transformers, adapter-transformers, open_clip_torch, pycocotools, timm, clip-benchmark, torch-rs.
Usage: Unzip test data and run inference with python codebase/inference.py --ckpt-path <path_to_model> --test-dataset-dir <path_to_data>. Model checkpoints and dataset links are available on Hugging Face.
Dataset: RS5M dataset is ~500GB and available in WebDataset format or as raw image files.

Highlighted Details

GeoRSCLIP achieves significant improvements (3%-20%) over baselines in Zero-shot Classification, Remote Sensing Cross-Modal Text–Image Retrieval (3%-6%), and Semantic Localization (4%-5%) tasks.
The RS5M dataset is the first large-scale image-text paired dataset specifically for remote sensing.
Offers pre-trained models for both CLIP-like (GeoRSCLIP) and Stable Diffusion (GeoRSSD) architectures adapted for remote sensing.
Data loading throughput of ~1800 images/sec is reported with specific hardware.

Maintenance & Community

The project is associated with the om-ai-lab.
Contact email: zilun.zhang@zju.edu.cn.
A Slack group is available for community interaction.
Links to related projects and papers are provided.

Licensing & Compatibility

The repository itself does not explicitly state a license in the README.
The RS5M dataset is available via Hugging Face and Baidu Disk.
Pre-trained models are hosted on Hugging Face.

Limitations & Caveats

The README does not specify a license for the code or models, which may impact commercial use.
The dataset is large (~500GB), requiring substantial storage and bandwidth.
Specific hardware configurations are mentioned for testing, implying potential performance variations on different setups.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

4 stars in the last 30 days

Explore Similar Projects

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI).

cobra by h-zhao1997

Multimodal LLM research paper extending Mamba for efficient inference

Created 1 year ago

Updated 1 year ago

Awesome-Remote-Sensing-Multimodal-Large-Language-Model by ZhanYang-nwpu

Survey of remote sensing multimodal LLMs

Created 2 years ago

Updated 1 month ago

awesome-vision-language-models-for-earth-observation by geoaigroup

Advancing Earth Observation with Vision-Language Models

Created 2 years ago

Updated 9 months ago

OV-DINO by wanghao9610

Research paper for open-vocabulary object detection

Created 1 year ago

Updated 10 months ago

Awesome-Open-Vocabulary-Semantic-Segmentation by Qinying-Liu

Curated publication list for open vocabulary semantic segmentation

Created 2 years ago

Updated 2 months ago

Awesome_Matching_Pretraining_Transfering by Paranioar

Curated paper list for multimodal AI research

Created 5 years ago

Updated 3 months ago

GeoChat by mbzuai-oryx

CVPR 2024 paper for remote sensing using a grounded VLM

Created 2 years ago

Updated 1 year ago

Starred by

Chaoyu Yang

Chaoyu Yang(Founder of Bento).

Vary by Ucas-HaoranWei

Vision-language model research paper implementation

Created 2 years ago

Updated 1 year ago

Awesome-Remote-Sensing-Foundation-Models by Jack-bo1220

Curated list of remote sensing foundation models, datasets, and benchmarks

Created 2 years ago

Updated 5 months ago

Starred by

Jianwei Yang

Jianwei Yang(Research Scientist at Meta Superintelligence Lab),

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI), and

5 more.

X-Decoder by microsoft

Generalized decoding model for pixel, image, and language tasks

Created 3 years ago

Updated 2 years ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo) and

Patrick von Platen

Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral).

recognize-anything by xinyu1205

Image tagging models for common/open-set categories and comprehensive captioning

Created 2 years ago

Updated 10 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"),

Simon Willison

Simon Willison(Coauthor of Django), and

10 more.

LAVIS by salesforce

Library for language-vision AI research

Created 3 years ago

Updated 1 year ago

Feedback? Help us improve.