AViD  by levy-tech-spark

Framework for fine-tuning vision-language grounding models

Created 6 months ago
600 stars

Top 54.5% on SourcePulse

GitHubView on GitHub
Project Summary

AViD is a framework for fine-tuning vision-language grounding models, specifically Grounding DINO, on custom datasets. It targets researchers and developers needing precise image-to-text region localization, offering parameter-efficient training via LoRA and EMA stabilization for improved performance and reduced storage.

How It Works

AViD builds upon Grounding DINO, enabling fine-tuning for image-to-text grounding. It employs LoRA (Low-Rank Adaptation) to train a small fraction of model parameters (around 2%), significantly reducing storage needs for fine-tuned models while maintaining performance. Exponential Moving Average (EMA) stabilization is used to preserve pre-trained knowledge during the fine-tuning process.

Quick Start & Requirements

  • Install: pip install -r requirements.txt followed by pip install -e .
  • Prerequisites: Python, PyTorch with CUDA support. CUDA architecture compatibility may need manual configuration via TORCH_CUDA_ARCH_LIST and FORCE_CUDA=1.
  • Demo: python demo/gradio_app.py --share
  • Training: python train.py --config configs/train_config.yaml
  • Evaluation: python test.py --config configs/test_config.yaml
  • Dataset: Sample fashion dataset available via gdown.
  • Links: GitHub Repo

Highlighted Details

  • Fine-tuning Grounding DINO with LoRA (rank-32 by default) and EMA stabilization.
  • Achieves significant mAP improvements: e.g., Shirt mAP from 0.62 to 0.89.
  • Includes a sample dataset and a comprehensive evaluation framework (mAP, Precision, Recall, F1).
  • Offers interactive Gradio demo and customizable YAML configuration for training.

Maintenance & Community

Current priorities include implementing techniques to prevent catastrophic forgetting, adding auxiliary losses, quantization support, distributed training, and HuggingFace integration. Contribution guidelines are provided.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The project is actively under development with several features planned, including support for preventing catastrophic forgetting, auxiliary losses, quantization, distributed training, and HuggingFace integration. Users may need to configure CUDA architecture compatibility manually.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.