Discover and explore top open-source AI tools and projects—updated daily.
Framework for fine-tuning vision-language grounding models
Top 54.5% on SourcePulse
AViD is a framework for fine-tuning vision-language grounding models, specifically Grounding DINO, on custom datasets. It targets researchers and developers needing precise image-to-text region localization, offering parameter-efficient training via LoRA and EMA stabilization for improved performance and reduced storage.
How It Works
AViD builds upon Grounding DINO, enabling fine-tuning for image-to-text grounding. It employs LoRA (Low-Rank Adaptation) to train a small fraction of model parameters (around 2%), significantly reducing storage needs for fine-tuned models while maintaining performance. Exponential Moving Average (EMA) stabilization is used to preserve pre-trained knowledge during the fine-tuning process.
Quick Start & Requirements
pip install -r requirements.txt
followed by pip install -e .
TORCH_CUDA_ARCH_LIST
and FORCE_CUDA=1
.python demo/gradio_app.py --share
python train.py --config configs/train_config.yaml
python test.py --config configs/test_config.yaml
gdown
.Highlighted Details
Maintenance & Community
Current priorities include implementing techniques to prevent catastrophic forgetting, adding auxiliary losses, quantization support, distributed training, and HuggingFace integration. Contribution guidelines are provided.
Licensing & Compatibility
Licensed under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
The project is actively under development with several features planned, including support for preventing catastrophic forgetting, auxiliary losses, quantization, distributed training, and HuggingFace integration. Users may need to configure CUDA architecture compatibility manually.
5 months ago
Inactive