dinov3-finetune by RobvanGastel

Finetuning self-supervised vision encoders for segmentation

Created 1 year ago

409 stars

Top 71.3% on SourcePulse

Project Summary

Summary This repository enables efficient finetuning of DINOv2 and DINOv3 self-supervised visual encoders for tasks like image segmentation. It targets researchers and engineers seeking to adapt powerful pre-trained models with minimal computational overhead. By employing Low-Rank Adaptation (LoRA), it allows task-specific finetuning with significantly fewer trainable parameters, preserving original encoder weights and reducing resource demands.

How It Works The project leverages pre-trained DINOv2 or DINOv3 encoders, renowned for their robust natural image domain representations. Finetuning is achieved via Low-Rank Adaptation (LoRA), which injects small, trainable low-rank matrices into transformer layers, freezing most pre-trained weights for efficiency. A lightweight 1x1 convolution or Feature Pyramid Network (FPN) decoder is trained atop the adapted encoder for segmentation tasks, facilitating effective transfer learning.

Quick Start & Requirements Installation requires a Python 3.11 Conda environment:

conda create --name dino python=3.11
conda activate dino
pip install -e .

Advanced visualization with FeatUp necessitates specific CUDA toolkit development tools (cudatoolkit-dev) and cuDNN, along with environment variable configuration (CUDA_HOME, LD_LIBRARY_PATH). An example finetuning command for VOC is:

python main.py --exp_name base_voc --dataset voc --size base --dino_type dinov3 --img_dim 308 308 --epochs 50 --use_fpn

Walkthroughs are available in Explanation.ipynb and Embedding_visualization.ipynb.

Highlighted Details

DINOv3 supports high-resolution video/image processing without external tools like FeatUp.
Achieved ~76.4% validation mIoU on Pascal VOC with DINOv3 ViT-L/16, LoRA, and a 1x1 decoder.
Robustness to common corruptions: ~73.8% mIoU on Pascal VOC-C (severity 5).
On ADE20k, performance reaches ~63.9% mIoU (DINOv3, LoRA, 1x1 decoder), dropping to ~60.7% with ADE20k-C.
Supports finetuning with/without LoRA and offers 1x1 convolution or FPN decoders.

Maintenance & Community Maintained by RobvanGastel, with recent updates in August/September 2025. No specific community channels or roadmap are detailed in the README.

Licensing & Compatibility The README does not explicitly state the project's license, requiring further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats The FeatUp visualization setup demands complex CUDA/cuDNN configuration. The absence of a stated software license is a potential adoption blocker. Performance on corrupted datasets can fluctuate.

dinov3-finetune by RobvanGastel

Explore Similar Projects

NExT-Chat by NExT-ChatV

EVE by baaivision

LOST by valeoai

Awesome-Open-Vocabulary-Semantic-Segmentation by Qinying-Liu

CVinW_Readings by Computer-Vision-in-the-Wild

ibot by bytedance

Modern-Computer-Vision-with-PyTorch-2E by PacktPublishing

PaddleViT by BR-IDL

X-AnyLabeling by CVHub520

Pytorch-UNet by milesial

pytorch-image-models by huggingface

learnopencv by spmallick