MobileVLM by Meituan-AutoML

Vision language model for mobile devices

Created 2 years ago

1,319 stars

Top 30.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

MobileVLM is a family of open-source vision-language models (VLMs) designed for efficient inference on mobile devices. It offers strong performance on various benchmarks, rivaling larger models, and provides fast inference speeds on mobile hardware. The project targets researchers and developers building multimodal applications for edge devices.

How It Works

MobileVLM utilizes a lightweight downsample projector (LDP) to efficiently fuse visual features from a frozen vision encoder with a language model (MobileLLaMA). This approach minimizes computational overhead, enabling faster inference and reduced memory footprint compared to standard VLMs. MobileVLM V2 further refines this by introducing LDPv2 and an improved training scheme tailored for mobile VLMs.

Quick Start & Requirements

Install: Clone the repository and install dependencies via pip install -r requirements.txt within a conda environment (Python 3.10 recommended).
Prerequisites: Access to pre-trained MobileLLaMA checkpoints (available on HuggingFace) and potentially large datasets for training.
Resources: Training MobileVLM V2 1.7B/3B requires approximately 38-52GB of GPU memory per model, with training times ranging from 3-12 hours on 8x A100 GPUs.
Links: HuggingFace, Code, llama.cpp Support

Highlighted Details

MobileVLM V2 1.7B achieves competitive performance against 3B models and MobileVLM V2 3B outperforms many 7B+ scale VLMs.
Achieves state-of-the-art inference speeds of 21.5 tokens/sec on Qualcomm Snapdragon 888 and 65.3 tokens/sec on NVIDIA Jetson Orin.
Supports training from scratch or fine-tuning with provided scripts and datasets.
Models are available on HuggingFace, with official support for deployment via llama.cpp.

Maintenance & Community

The project has seen active development with releases of MobileVLM V2, MobileLLaMA pre-training code, and SFT code. Community engagement is encouraged, with support for deployment on mobile devices being a key focus.

Licensing & Compatibility

The code is licensed under Apache 2.0. However, the project notes that it utilizes datasets and checkpoints subject to their original licenses, requiring users to comply with all terms.

Limitations & Caveats

The project relies on external datasets which require manual downloading and organization, potentially involving significant storage and setup time. While the code is Apache 2.0 licensed, the use of underlying datasets and checkpoints may introduce additional licensing considerations for commercial applications.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days