Janus  by deepseek-ai

Unified multimodal model research paper for understanding and generation

created 9 months ago
17,475 stars

Top 2.6% on sourcepulse

GitHubView on GitHub
Project Summary

The Janus series offers unified multimodal understanding and generation capabilities, targeting researchers and developers working with vision-language models. It provides a flexible framework for tasks like image captioning and text-to-image generation, aiming to match or exceed specialized model performance with a simpler, unified architecture.

How It Works

Janus employs an autoregressive framework that decouples visual encoding into separate pathways, mitigating conflicts between understanding and generation roles. This approach enhances flexibility while using a single transformer architecture. JanusFlow integrates this with rectified flow, a state-of-the-art generative modeling technique, directly within the LLM framework without complex architectural changes.

Quick Start & Requirements

  • Installation: pip install -e . (add [gradio] for Gradio demo).
  • Prerequisites: Python >= 3.8, PyTorch, Transformers, Diffusers (for JanusFlow). GPU with CUDA is required for inference.
  • Resources: Models range from 1.3B to 7B parameters. Inference requires significant VRAM.
  • Demos: Online demos available on Hugging Face for Janus-Pro, Janus, and JanusFlow. Local Gradio and FastAPI demos are also provided.

Highlighted Details

  • Unified Framework: Handles both multimodal understanding and generation within a single model.
  • Janus-Pro: Advanced version with optimized training, expanded data, and larger model scaling for improved performance and stability.
  • JanusFlow: Integrates rectified flow for efficient and versatile vision-language generation.
  • Commercial Use: Permitted under the DeepSeek Model License.

Maintenance & Community

The project is actively developed by DeepSeek AI. Contact is available via email (service@deepseek.com) or by raising issues on the repository.

Licensing & Compatibility

  • Code License: MIT License.
  • Model License: DeepSeek Model License. Commercial usage is permitted.

Limitations & Caveats

The provided inference code examples utilize torch.bfloat16 and .cuda(), indicating a strong dependency on NVIDIA GPUs and specific PyTorch versions. The text-to-image generation examples involve complex, multi-step processes with specific parameters that may require fine-tuning for optimal results.

Health Check
Last commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
392 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley).

DeepSeek-Coder-V2 by deepseek-ai

0.4%
6k
Open-source code language model comparable to GPT4-Turbo
created 1 year ago
updated 10 months ago
Starred by Michael Han Michael Han(Cofounder of Unsloth), Sebastian Raschka Sebastian Raschka(Author of Build a Large Language Model From Scratch), and
6 more.

DeepSeek-R1 by deepseek-ai

0.1%
91k
Reasoning models research paper
created 6 months ago
updated 1 month ago
Feedback? Help us improve.