Janus  by deepseek-ai

Unified multimodal model research paper for understanding and generation

Created 11 months ago
17,547 stars

Top 2.7% on SourcePulse

GitHubView on GitHub
Project Summary

The Janus series offers unified multimodal understanding and generation capabilities, targeting researchers and developers working with vision-language models. It provides a flexible framework for tasks like image captioning and text-to-image generation, aiming to match or exceed specialized model performance with a simpler, unified architecture.

How It Works

Janus employs an autoregressive framework that decouples visual encoding into separate pathways, mitigating conflicts between understanding and generation roles. This approach enhances flexibility while using a single transformer architecture. JanusFlow integrates this with rectified flow, a state-of-the-art generative modeling technique, directly within the LLM framework without complex architectural changes.

Quick Start & Requirements

  • Installation: pip install -e . (add [gradio] for Gradio demo).
  • Prerequisites: Python >= 3.8, PyTorch, Transformers, Diffusers (for JanusFlow). GPU with CUDA is required for inference.
  • Resources: Models range from 1.3B to 7B parameters. Inference requires significant VRAM.
  • Demos: Online demos available on Hugging Face for Janus-Pro, Janus, and JanusFlow. Local Gradio and FastAPI demos are also provided.

Highlighted Details

  • Unified Framework: Handles both multimodal understanding and generation within a single model.
  • Janus-Pro: Advanced version with optimized training, expanded data, and larger model scaling for improved performance and stability.
  • JanusFlow: Integrates rectified flow for efficient and versatile vision-language generation.
  • Commercial Use: Permitted under the DeepSeek Model License.

Maintenance & Community

The project is actively developed by DeepSeek AI. Contact is available via email (service@deepseek.com) or by raising issues on the repository.

Licensing & Compatibility

  • Code License: MIT License.
  • Model License: DeepSeek Model License. Commercial usage is permitted.

Limitations & Caveats

The provided inference code examples utilize torch.bfloat16 and .cuda(), indicating a strong dependency on NVIDIA GPUs and specific PyTorch versions. The text-to-image generation examples involve complex, multi-step processes with specific parameters that may require fine-tuning for optimal results.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
72 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.