ml-fastvlm by apple

Vision-language model research paper

Created 8 months ago

7,141 stars

Top 7.2% on SourcePulse

View on GitHub

5 Experts Love This Project

Cybersecurity Lead at Google DeepMind

Luis Capelo

Cofounder of Lightning AI

and 1 more!

Project Summary

FastVLM offers an efficient vision encoding solution for vision-language models (VLMs), targeting researchers and developers seeking faster processing of high-resolution images. It significantly reduces token count and encoding time, outperforming existing models in speed and efficiency.

How It Works

FastVLM introduces FastViTHD, a novel hybrid vision encoder. This architecture is designed to output fewer tokens from high-resolution images, leading to substantial reductions in encoding time. The hybrid approach balances efficiency with performance, enabling faster Time-to-First-Token (TTFT) and smaller model footprints compared to traditional methods.

Quick Start & Requirements

Install: pip install -e . within a conda create -n fastvlm python=3.10 environment.
Download checkpoints: bash get_models.sh
Inference: python predict.py --model-path /path/to/checkpoint-dir --image-file /path/to/image.png --prompt "Describe the image."
Apple Silicon export and inference instructions are available in the model_export subfolder.
Inference on Apple devices (iPhone, iPad, Mac) is detailed in the app subfolder.
Official paper: [CVPR 2025]

Highlighted Details

Achieves 85x faster TTFT and 3.4x smaller vision encoder than LLaVA-OneVision-0.5B with its smallest variant.
Larger variants outperform Cambrian-1-8B with a 7.9x faster TTFT.
Offers pre-trained checkpoints for 0.5B, 1.5B, and 7B parameter models across two training stages.
Includes an iOS demo app showcasing mobile performance.

Maintenance & Community

Codebase built using multiple open-source contributions.
Citation details provided for the CVPR 2025 paper.

Licensing & Compatibility

Code license and model license details are available in separate LICENSE files. Specific license types are not detailed in the README.

Limitations & Caveats

The project relies on the LLaVA codebase for training and finetuning, requiring users to follow LLaVA's instructions for these processes. Specific compatibility or performance details for various hardware configurations beyond Apple Silicon are not extensively detailed.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

109 stars in the last 30 days