DeepSeek-VL by deepseek-ai

Vision-language model for real-world applications (research paper)

Created 1 year ago

4,045 stars

Top 12.0% on SourcePulse

View on GitHub

2 Experts Love This Project

Jesse Clark

Cofounder of Marqo

Jiaming Song

Chief Scientist at Luma AI

Project Summary

DeepSeek-VL is an open-source vision-language model designed for real-world multimodal understanding tasks. It supports diverse inputs like logical diagrams, web pages, scientific literature, and natural images, targeting researchers and developers building advanced AI applications. The models offer general multimodal capabilities, enabling complex reasoning and interaction with visual and textual data.

How It Works

DeepSeek-VL integrates a vision encoder with a large language model (LLM) to achieve multimodal understanding. It processes images and text through a unified architecture, allowing for tasks like image description, visual question answering, and multi-image reasoning. The models are available in 7B and 1.3B parameter sizes, with both base and chat variants, supporting a sequence length of 4096 tokens.

Quick Start & Requirements

Install via pip: pip install -e .
Requires Python >= 3.8.
Inference requires a CUDA-enabled GPU and torch.bfloat16 support.
Official Hugging Face demo available: https://huggingface.co/spaces/deepseek-ai/DeepSeek-VL-7B
Gradio demo can be run with: pip install -e .[gradio] and python deepseek_vl/serve/app_deepseek.py

Highlighted Details

Supports multiple images in a single conversation for in-context learning.
Offers both base and chat fine-tuned models.
Models available in 1.3B and 7B parameter sizes.
4096 token sequence length.

Maintenance & Community

Released March 2024.
Contact: service@deepseek.com

Licensing & Compatibility

Code repository licensed under MIT.
Model usage subject to DeepSeek Model License.
Supports commercial use for both Base and Chat models.

Limitations & Caveats

The provided README does not detail specific performance benchmarks or known limitations of the models. Inference requires a GPU and specific PyTorch data types (bfloat16).

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

28 stars in the last 30 days