ml-fastvlm  by apple

Vision-language model research paper

created 3 months ago
5,207 stars

Top 9.8% on sourcepulse

GitHubView on GitHub
Project Summary

FastVLM offers an efficient vision encoding solution for vision-language models (VLMs), targeting researchers and developers seeking faster processing of high-resolution images. It significantly reduces token count and encoding time, outperforming existing models in speed and efficiency.

How It Works

FastVLM introduces FastViTHD, a novel hybrid vision encoder. This architecture is designed to output fewer tokens from high-resolution images, leading to substantial reductions in encoding time. The hybrid approach balances efficiency with performance, enabling faster Time-to-First-Token (TTFT) and smaller model footprints compared to traditional methods.

Quick Start & Requirements

  • Install: pip install -e . within a conda create -n fastvlm python=3.10 environment.
  • Download checkpoints: bash get_models.sh
  • Inference: python predict.py --model-path /path/to/checkpoint-dir --image-file /path/to/image.png --prompt "Describe the image."
  • Apple Silicon export and inference instructions are available in the model_export subfolder.
  • Inference on Apple devices (iPhone, iPad, Mac) is detailed in the app subfolder.
  • Official paper: [CVPR 2025]

Highlighted Details

  • Achieves 85x faster TTFT and 3.4x smaller vision encoder than LLaVA-OneVision-0.5B with its smallest variant.
  • Larger variants outperform Cambrian-1-8B with a 7.9x faster TTFT.
  • Offers pre-trained checkpoints for 0.5B, 1.5B, and 7B parameter models across two training stages.
  • Includes an iOS demo app showcasing mobile performance.

Maintenance & Community

  • Codebase built using multiple open-source contributions.
  • Citation details provided for the CVPR 2025 paper.

Licensing & Compatibility

  • Code license and model license details are available in separate LICENSE files. Specific license types are not detailed in the README.

Limitations & Caveats

The project relies on the LLaVA codebase for training and finetuning, requiring users to follow LLaVA's instructions for these processes. Specific compatibility or performance details for various hardware configurations beyond Apple Silicon are not extensively detailed.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
2
Star History
5,247 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.