ml-ferret  by apple

MLLM for referring and grounding anything, anywhere

created 1 year ago
8,646 stars

Top 6.0% on sourcepulse

GitHubView on GitHub
Project Summary

Ferret is an end-to-end multimodal large language model (MLLM) designed for fine-grained, open-vocabulary referring and grounding tasks. It targets researchers and developers working with multimodal AI, offering capabilities for precise object identification and interaction within images based on natural language descriptions.

How It Works

Ferret employs a hybrid region representation and a spatial-aware visual sampler. This approach enables the model to understand and process referring expressions at a granular level, allowing it to ground objects anywhere in an image. The model builds upon the LLaVA architecture and utilizes Vicuna as its base LLM.

Quick Start & Requirements

  • Install: Clone the repository and install via pip install -e . within a Python 3.10 conda environment. Additional packages like pycocotools and protobuf==3.20.0 are required. For training, ninja and flash-attn are needed.
  • Prerequisites: Vicuna v1.3 weights and LLaVA's first-stage projector weights are necessary. Applying delta weights requires downloading specific offset files.
  • Demo: Requires launching a controller, Gradio web server, and model worker.
  • Training: Requires 8x A100 GPUs with 80GB memory.
  • Links: Paper, Ferret-UI, Ferret-Bench

Highlighted Details

  • Introduces the GRIT dataset (~1.1M samples) for instruction tuning.
  • Developed Ferret-Bench, a multimodal evaluation benchmark.
  • Ferret-v2 accepted to COLM 2024; Ferret accepted to ICLR 2024 as a Spotlight.
  • Offers pre-trained checkpoints for 7B and 13B parameter models.

Maintenance & Community

The project is from Apple. Key contributions are acknowledged from LLaVA and Vicuna projects. Further community interaction details are not specified in the README.

Licensing & Compatibility

The data, code, and model weights are licensed for research use only. Usage is restricted by the license agreements of LLaMA, Vicuna, and GPT-4. The dataset is CC BY NC 4.0, prohibiting commercial use. Model weight differentials are also licensed under CC-BY-NC.

Limitations & Caveats

The project explicitly states that all components are intended and licensed for research purposes only, with restrictions on commercial use due to underlying model licenses and the CC BY NC 4.0 dataset license.

Health Check
Last commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
58 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.