MobileVLM  by Meituan-AutoML

Vision language model for mobile devices

created 1 year ago
1,246 stars

Top 32.3% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

MobileVLM is a family of open-source vision-language models (VLMs) designed for efficient inference on mobile devices. It offers strong performance on various benchmarks, rivaling larger models, and provides fast inference speeds on mobile hardware. The project targets researchers and developers building multimodal applications for edge devices.

How It Works

MobileVLM utilizes a lightweight downsample projector (LDP) to efficiently fuse visual features from a frozen vision encoder with a language model (MobileLLaMA). This approach minimizes computational overhead, enabling faster inference and reduced memory footprint compared to standard VLMs. MobileVLM V2 further refines this by introducing LDPv2 and an improved training scheme tailored for mobile VLMs.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies via pip install -r requirements.txt within a conda environment (Python 3.10 recommended).
  • Prerequisites: Access to pre-trained MobileLLaMA checkpoints (available on HuggingFace) and potentially large datasets for training.
  • Resources: Training MobileVLM V2 1.7B/3B requires approximately 38-52GB of GPU memory per model, with training times ranging from 3-12 hours on 8x A100 GPUs.
  • Links: HuggingFace, Code, llama.cpp Support

Highlighted Details

  • MobileVLM V2 1.7B achieves competitive performance against 3B models and MobileVLM V2 3B outperforms many 7B+ scale VLMs.
  • Achieves state-of-the-art inference speeds of 21.5 tokens/sec on Qualcomm Snapdragon 888 and 65.3 tokens/sec on NVIDIA Jetson Orin.
  • Supports training from scratch or fine-tuning with provided scripts and datasets.
  • Models are available on HuggingFace, with official support for deployment via llama.cpp.

Maintenance & Community

The project has seen active development with releases of MobileVLM V2, MobileLLaMA pre-training code, and SFT code. Community engagement is encouraged, with support for deployment on mobile devices being a key focus.

Licensing & Compatibility

The code is licensed under Apache 2.0. However, the project notes that it utilizes datasets and checkpoints subject to their original licenses, requiring users to comply with all terms.

Limitations & Caveats

The project relies on external datasets which require manual downloading and organization, potentially involving significant storage and setup time. While the code is Apache 2.0 licensed, the use of underlying datasets and checkpoints may introduce additional licensing considerations for commercial applications.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
46 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.