DeepSeek-VL2  by deepseek-ai

MoE vision-language model for multimodal understanding

Created 9 months ago
5,048 stars

Top 9.8% on SourcePulse

GitHubView on GitHub
Project Summary

DeepSeek-VL2 is a series of Mixture-of-Experts (MoE) vision-language models designed for advanced multimodal understanding tasks like visual question answering, OCR, and document analysis. Targeting researchers and developers, it offers competitive performance with efficient parameter activation, providing three variants: Tiny (1B), Small (2.8B), and the full (4.5B) model.

How It Works

DeepSeek-VL2 leverages a Mixture-of-Experts architecture, activating a subset of parameters for each inference pass. This approach allows for a larger total parameter count while maintaining computational efficiency and faster inference compared to dense models of similar performance. The models support multimodal inputs, including multiple images and object localization via special tokens.

Quick Start & Requirements

  • Install dependencies: pip install -e .
  • Requires Python >= 3.8.
  • Inference examples suggest 80GB GPU memory for deepseek-vl2-small and larger for deepseek-vl2.
  • Incremental prefilling can reduce memory requirements for deepseek-vl2-small to 40GB.
  • Official demos and inference scripts are available.

Highlighted Details

  • Three model variants: Tiny (1B), Small (2.8B), and base (4.5B) activated parameters.
  • Supports multimodal inputs, including multiple images and object localization with bounding box output.
  • Achieves competitive or state-of-the-art performance with efficient MoE activation.
  • Offers incremental prefilling for reduced memory usage during inference.

Maintenance & Community

The project was released in December 2024. Contact is available via GitHub issues or service@deepseek.com.

Licensing & Compatibility

The code repository is licensed under the MIT License. Model usage is subject to the DeepSeek Model License, which permits commercial use.

Limitations & Caveats

The provided Gradio demo is a basic implementation and may exhibit slower performance; production environments should consider optimized deployment solutions like vLLM or vLLM. The README notes that larger models require significant GPU memory (80GB+).

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
42 stars in the last 30 days

Explore Similar Projects

Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Forrest Iandola Forrest Iandola(Author of SqueezeNet; Research Scientist at Meta), and
17 more.

MiniGPT-4 by Vision-CAIR

0.0%
26k
Vision-language model for multi-task learning
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.