efficientvit by mit-han-lab

Vision models for high-resolution generation/perception tasks

Created 2 years ago

3,194 stars

Top 14.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Christian Laforte

Distinguished Engineer at NVIDIA; Former CTO at Stability AI

Jiaming Song

Chief Scientist at Luma AI

Project Summary

This repository provides EfficientViT, a family of lightweight vision foundation models designed for high-resolution generation and perception tasks. It offers accelerated versions of models like Segment Anything (SAM) and enables efficient high-resolution diffusion models through its Deep Compression Autoencoder (DC-AE). The target audience includes researchers and developers working on efficient computer vision, particularly for deployment on resource-constrained devices or for high-throughput applications.

How It Works

EfficientViT utilizes a novel architecture that incorporates multi-scale attention mechanisms, allowing it to efficiently process high-resolution images. The Deep Compression Autoencoder (DC-AE) family offers high spatial compression ratios (up to 128x) while maintaining reconstruction quality, significantly accelerating latent diffusion models. EfficientViT-SAM replaces the heavy image encoder in SAM with EfficientViT, achieving substantial speedups (e.g., 48.9x TensorRT speedup on A100) without accuracy loss.

Quick Start & Requirements

Install: conda create -n efficientvit python=3.10, conda activate efficientvit, pip install -U -r requirements.txt
Prerequisites: Python 3.10, Conda environment. Specific models may require GPU acceleration (e.g., TensorRT for EfficientViT-SAM).
Resources: Pretrained models are available. Training and evaluation may require significant computational resources.
Links:
- EfficientViT-SAM Demo: https://evitsam.hanlab.ai/
- DC-AE Models in Diffusers: https://huggingface.co/models?search=dc-ae

Highlighted Details

DC-AE+USiT-2B achieves 1.72 FID on ImageNet 512x512, surpassing SOTA diffusion and autoregressive models.
EfficientViT-SAM offers a 48.9x measured TensorRT speedup over SAM-ViT-H without accuracy loss.
DC-AE enables efficient text-to-image generation on laptops (e.g., SANA project).
EfficientViT backbones are integrated into Grounding DINO 1.5 Edge and MedficientSAM (1st place in CVPR 2024 Segment Anything In Medical Images On Laptop Challenge).

Maintenance & Community

The project is associated with the MIT Han Lab. Notable integrations include NVIDIA Jetson Generative AI Lab, timm, X-AnyLabeling, and Grounding DINO 1.5 Edge. Papers have been accepted to ICLR 2025, CVPR 2024, and ICCV 2023.

Licensing & Compatibility

The README does not explicitly state the license. However, the project is open-source and has been integrated into various third-party projects, suggesting broad compatibility.

Limitations & Caveats

While aiming for efficiency, achieving maximum performance (e.g., TensorRT speedups) may require specific hardware and software configurations. The project is actively developing new models and features, with recent updates indicating ongoing research and releases.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

25 stars in the last 30 days