DeepSeek-VL  by deepseek-ai

Vision-language model for real-world applications (research paper)

Created 1 year ago
3,961 stars

Top 12.4% on SourcePulse

GitHubView on GitHub
Project Summary

DeepSeek-VL is an open-source vision-language model designed for real-world multimodal understanding tasks. It supports diverse inputs like logical diagrams, web pages, scientific literature, and natural images, targeting researchers and developers building advanced AI applications. The models offer general multimodal capabilities, enabling complex reasoning and interaction with visual and textual data.

How It Works

DeepSeek-VL integrates a vision encoder with a large language model (LLM) to achieve multimodal understanding. It processes images and text through a unified architecture, allowing for tasks like image description, visual question answering, and multi-image reasoning. The models are available in 7B and 1.3B parameter sizes, with both base and chat variants, supporting a sequence length of 4096 tokens.

Quick Start & Requirements

  • Install via pip: pip install -e .
  • Requires Python >= 3.8.
  • Inference requires a CUDA-enabled GPU and torch.bfloat16 support.
  • Official Hugging Face demo available: https://huggingface.co/spaces/deepseek-ai/DeepSeek-VL-7B
  • Gradio demo can be run with: pip install -e .[gradio] and python deepseek_vl/serve/app_deepseek.py

Highlighted Details

  • Supports multiple images in a single conversation for in-context learning.
  • Offers both base and chat fine-tuned models.
  • Models available in 1.3B and 7B parameter sizes.
  • 4096 token sequence length.

Maintenance & Community

Licensing & Compatibility

  • Code repository licensed under MIT.
  • Model usage subject to DeepSeek Model License.
  • Supports commercial use for both Base and Chat models.

Limitations & Caveats

The provided README does not detail specific performance benchmarks or known limitations of the models. Inference requires a GPU and specific PyTorch data types (bfloat16).

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
16 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.