RADIO by NVlabs

Vision foundation model for distilling large models

Created 2 years ago

1,403 stars

Top 28.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Jesse Clark

Cofounder of Marqo

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Project Summary

NVlabs/RADIO provides an Agglomerative Vision Foundation Model (AM-RADIO) designed to distill multiple vision foundation models into a single, versatile backbone. It aims to serve as a superior replacement for traditional vision backbones across various domains, offering strong performance in image classification, segmentation, and vision-language tasks. The framework is suitable for researchers and developers seeking a unified and high-performing vision model.

How It Works

RADIO integrates diverse vision foundation models like CLIP variants, DINOv2, and SAM through a distillation process. This agglomerative approach allows it to preserve and combine unique features from its teachers, such as text grounding and segmentation correspondence. The model architecture is based on Vision Transformers (ViTs) and supports arbitrary input resolutions, including non-square images, with an efficient variant (E-RADIO) offering significant speedups.

Quick Start & Requirements

TorchHub: torch.hub.load('NVlabs/RADIO', 'radio_model', version='radio_v2.5-h', progress=True)
HuggingFace: AutoModel.from_pretrained("nvidia/RADIO", trust_remote_code=True)
Prerequisites: PyTorch, CUDA (for GPU acceleration), Pillow, Transformers. Mixed precision (bfloat16) is supported.
Input: Images should be in the range [0, 1].
Docs: TorchHub Usage, HuggingFace Usage

Highlighted Details

Achieves state-of-the-art performance, outperforming teachers in ImageNet zero-shot (+6.8%), kNN (+2.39%), and linear probing segmentation (+3.8%).
Demonstrates strong vision-language model capabilities, improving LLaVa 1.5 performance by up to 1.5%.
Supports flexible input resolutions, including non-square images, with specific models having preferred or maximum resolution constraints.
Offers adaptors for specific teacher behaviors (e.g., clip, siglip, dino_v2, sam) and allows fetching intermediate layer activations.

Maintenance & Community

The project is actively developed by NVIDIA Research, with multiple versions released, including RADIOv2.5, C-RADIO (for commercial use), and related research papers like PHI-S and FeatSharp.

Licensing & Compatibility

The primary license is the NVIDIA Source Code License-NC, which restricts commercial use. However, the C-RADIO model is released under the NVIDIA Open Model License Agreement, permitting commercial products.

Limitations & Caveats

E-RADIO has limitations, primarily supporting images divisible by 32 for optimal performance, and its efficiency relies on correctly setting the window size for attention blocks. Older versions (e.g., RADIOv2.1) had mode-switching issues at higher resolutions, which have been addressed in RADIOv2.5.

Health Check

Last Commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

29 stars in the last 30 days