scaling_on_scales  by bfshi

Pytorch wrapper for multi-scale vision feature extraction

created 1 year ago
404 stars

Top 73.1% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides S²-Wrapper, a PyTorch mechanism for extracting multi-scale features from any vision model. It enables improved performance by scaling image resolution rather than solely relying on larger model sizes, targeting researchers and developers working with vision models, particularly in multimodal contexts.

How It Works

S²-Wrapper operates by wrapping a given vision model's forward pass. It intelligently resizes and potentially splits input images to specified scales, feeds them through the model, and concatenates the resulting features. This approach allows models to process information at multiple resolutions, capturing finer details and broader context without requiring architectural changes to the base model.

Quick Start & Requirements

  • Install via pip: pip install git+https://github.com/bfshi/scaling_on_scales.git
  • Requires PyTorch.
  • Supports non-square images (experimental branch dev_any_shape).
  • Official integration and checkpoints available for LLaVA and NVIDIA VILA.
  • Documentation: https://arxiv.org/abs/2403.13043

Highlighted Details

  • Enables multi-scale feature extraction with a single line of code.
  • Integrated into LLaVA and NVIDIA VILA, with performance benchmarks provided.
  • Supports dynamic aspect ratio processing via Dynamic-S² in NVILA.
  • Offers options for splitting large images to manage memory usage.

Maintenance & Community

  • Accepted to ECCV 2024.
  • Active development with ongoing to-dos for new checkpoints and features.
  • Integrations with major projects like LLaVA and NVIDIA VILA indicate community adoption.

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the README. The code is available on GitHub, implying a permissive license unless otherwise specified.
  • Compatible with standard PyTorch vision models.

Limitations & Caveats

  • Support for non-square images is noted as experimental.
  • Training requires specific configuration changes to existing frameworks like LLaVA.
Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.