dinov3-finetune  by RobvanGastel

Finetuning self-supervised vision encoders for segmentation

Created 1 year ago
309 stars

Top 86.9% on SourcePulse

GitHubView on GitHub
Project Summary

Summary This repository enables efficient finetuning of DINOv2 and DINOv3 self-supervised visual encoders for tasks like image segmentation. It targets researchers and engineers seeking to adapt powerful pre-trained models with minimal computational overhead. By employing Low-Rank Adaptation (LoRA), it allows task-specific finetuning with significantly fewer trainable parameters, preserving original encoder weights and reducing resource demands.

How It Works The project leverages pre-trained DINOv2 or DINOv3 encoders, renowned for their robust natural image domain representations. Finetuning is achieved via Low-Rank Adaptation (LoRA), which injects small, trainable low-rank matrices into transformer layers, freezing most pre-trained weights for efficiency. A lightweight 1x1 convolution or Feature Pyramid Network (FPN) decoder is trained atop the adapted encoder for segmentation tasks, facilitating effective transfer learning.

Quick Start & Requirements Installation requires a Python 3.11 Conda environment:

conda create --name dino python=3.11
conda activate dino
pip install -e .

Advanced visualization with FeatUp necessitates specific CUDA toolkit development tools (cudatoolkit-dev) and cuDNN, along with environment variable configuration (CUDA_HOME, LD_LIBRARY_PATH). An example finetuning command for VOC is:

python main.py --exp_name base_voc --dataset voc --size base --dino_type dinov3 --img_dim 308 308 --epochs 50 --use_fpn

Walkthroughs are available in Explanation.ipynb and Embedding_visualization.ipynb.

Highlighted Details

  • DINOv3 supports high-resolution video/image processing without external tools like FeatUp.
  • Achieved ~76.4% validation mIoU on Pascal VOC with DINOv3 ViT-L/16, LoRA, and a 1x1 decoder.
  • Robustness to common corruptions: ~73.8% mIoU on Pascal VOC-C (severity 5).
  • On ADE20k, performance reaches ~63.9% mIoU (DINOv3, LoRA, 1x1 decoder), dropping to ~60.7% with ADE20k-C.
  • Supports finetuning with/without LoRA and offers 1x1 convolution or FPN decoders.

Maintenance & Community Maintained by RobvanGastel, with recent updates in August/September 2025. No specific community channels or roadmap are detailed in the README.

Licensing & Compatibility The README does not explicitly state the project's license, requiring further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats The FeatUp visualization setup demands complex CUDA/cuDNN configuration. The absence of a stated software license is a potential adoption blocker. Performance on corrupted datasets can fluctuate.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
1
Star History
52 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.