MLLM for referring and grounding anything, anywhere
Top 6.0% on sourcepulse
Ferret is an end-to-end multimodal large language model (MLLM) designed for fine-grained, open-vocabulary referring and grounding tasks. It targets researchers and developers working with multimodal AI, offering capabilities for precise object identification and interaction within images based on natural language descriptions.
How It Works
Ferret employs a hybrid region representation and a spatial-aware visual sampler. This approach enables the model to understand and process referring expressions at a granular level, allowing it to ground objects anywhere in an image. The model builds upon the LLaVA architecture and utilizes Vicuna as its base LLM.
Quick Start & Requirements
pip install -e .
within a Python 3.10 conda environment. Additional packages like pycocotools
and protobuf==3.20.0
are required. For training, ninja
and flash-attn
are needed.Highlighted Details
Maintenance & Community
The project is from Apple. Key contributions are acknowledged from LLaVA and Vicuna projects. Further community interaction details are not specified in the README.
Licensing & Compatibility
The data, code, and model weights are licensed for research use only. Usage is restricted by the license agreements of LLaMA, Vicuna, and GPT-4. The dataset is CC BY NC 4.0, prohibiting commercial use. Model weight differentials are also licensed under CC-BY-NC.
Limitations & Caveats
The project explicitly states that all components are intended and licensed for research purposes only, with restrictions on commercial use due to underlying model licenses and the CC BY NC 4.0 dataset license.
9 months ago
Inactive