DeepSeek-VL2 by deepseek-ai

MoE vision-language model for multimodal understanding

Created 1 year ago

5,178 stars

Top 9.6% on SourcePulse

View on GitHub

2 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Elvis Saravia

Founder of DAIR.AI

Project Summary

DeepSeek-VL2 is a series of Mixture-of-Experts (MoE) vision-language models designed for advanced multimodal understanding tasks like visual question answering, OCR, and document analysis. Targeting researchers and developers, it offers competitive performance with efficient parameter activation, providing three variants: Tiny (1B), Small (2.8B), and the full (4.5B) model.

How It Works

DeepSeek-VL2 leverages a Mixture-of-Experts architecture, activating a subset of parameters for each inference pass. This approach allows for a larger total parameter count while maintaining computational efficiency and faster inference compared to dense models of similar performance. The models support multimodal inputs, including multiple images and object localization via special tokens.

Quick Start & Requirements

Install dependencies: pip install -e .
Requires Python >= 3.8.
Inference examples suggest 80GB GPU memory for deepseek-vl2-small and larger for deepseek-vl2.
Incremental prefilling can reduce memory requirements for deepseek-vl2-small to 40GB.
Official demos and inference scripts are available.

Highlighted Details

Three model variants: Tiny (1B), Small (2.8B), and base (4.5B) activated parameters.
Supports multimodal inputs, including multiple images and object localization with bounding box output.
Achieves competitive or state-of-the-art performance with efficient MoE activation.
Offers incremental prefilling for reduced memory usage during inference.

Maintenance & Community

The project was released in December 2024. Contact is available via GitHub issues or service@deepseek.com.

Licensing & Compatibility

The code repository is licensed under the MIT License. Model usage is subject to the DeepSeek Model License, which permits commercial use.

Limitations & Caveats

The provided Gradio demo is a basic implementation and may exhibit slower performance; production environments should consider optimized deployment solutions like vLLM or vLLM. The README notes that larger models require significant GPU memory (80GB+).

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

35 stars in the last 30 days