AViD by levy-tech-spark

Framework for fine-tuning vision-language grounding models

Created 9 months ago

601 stars

Top 54.4% on SourcePulse

Project Summary

AViD is a framework for fine-tuning vision-language grounding models, specifically Grounding DINO, on custom datasets. It targets researchers and developers needing precise image-to-text region localization, offering parameter-efficient training via LoRA and EMA stabilization for improved performance and reduced storage.

How It Works

AViD builds upon Grounding DINO, enabling fine-tuning for image-to-text grounding. It employs LoRA (Low-Rank Adaptation) to train a small fraction of model parameters (around 2%), significantly reducing storage needs for fine-tuned models while maintaining performance. Exponential Moving Average (EMA) stabilization is used to preserve pre-trained knowledge during the fine-tuning process.

Quick Start & Requirements

Install: pip install -r requirements.txt followed by pip install -e .
Prerequisites: Python, PyTorch with CUDA support. CUDA architecture compatibility may need manual configuration via TORCH_CUDA_ARCH_LIST and FORCE_CUDA=1.
Demo: python demo/gradio_app.py --share
Training: python train.py --config configs/train_config.yaml
Evaluation: python test.py --config configs/test_config.yaml
Dataset: Sample fashion dataset available via gdown.
Links: GitHub Repo

Highlighted Details

Fine-tuning Grounding DINO with LoRA (rank-32 by default) and EMA stabilization.
Achieves significant mAP improvements: e.g., Shirt mAP from 0.62 to 0.89.
Includes a sample dataset and a comprehensive evaluation framework (mAP, Precision, Recall, F1).
Offers interactive Gradio demo and customizable YAML configuration for training.

Maintenance & Community

Current priorities include implementing techniques to prevent catastrophic forgetting, adding auxiliary losses, quantization support, distributed training, and HuggingFace integration. Contribution guidelines are provided.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The project is actively under development with several features planned, including support for preventing catastrophic forgetting, auxiliary losses, quantization, distributed training, and HuggingFace integration. Users may need to configure CUDA architecture compatibility manually.

AViD by levy-tech-spark

Explore Similar Projects

lynx-llm by bytedance

GroundingGPT by lzw-lzw

llava-phi by xmoanvaf

Vitron by SkyworkAI

BakLLaVA by SkunkworksAI

X-VLM by zengyan-97

LLaVA-Plus-Codebase by LLaVA-VL

TinyGPT-V by DLYuanGod

molmo by allenai

Vary by Ucas-HaoranWei

OFA by OFA-Sys

EasyR1 by hiyouga