Janus by deepseek-ai

Unified multimodal model research paper for understanding and generation

Created 1 year ago

17,663 stars

Top 2.6% on SourcePulse

View on GitHub

9 Experts Love This Project

Jiaming Song

Chief Scientist at Luma AI

Inference Lead at SGLang; Research Scientist at Together AI

and 5 more!

Project Summary

The Janus series offers unified multimodal understanding and generation capabilities, targeting researchers and developers working with vision-language models. It provides a flexible framework for tasks like image captioning and text-to-image generation, aiming to match or exceed specialized model performance with a simpler, unified architecture.

How It Works

Janus employs an autoregressive framework that decouples visual encoding into separate pathways, mitigating conflicts between understanding and generation roles. This approach enhances flexibility while using a single transformer architecture. JanusFlow integrates this with rectified flow, a state-of-the-art generative modeling technique, directly within the LLM framework without complex architectural changes.

Quick Start & Requirements

Installation: pip install -e . (add [gradio] for Gradio demo).
Prerequisites: Python >= 3.8, PyTorch, Transformers, Diffusers (for JanusFlow). GPU with CUDA is required for inference.
Resources: Models range from 1.3B to 7B parameters. Inference requires significant VRAM.
Demos: Online demos available on Hugging Face for Janus-Pro, Janus, and JanusFlow. Local Gradio and FastAPI demos are also provided.

Highlighted Details

Unified Framework: Handles both multimodal understanding and generation within a single model.
Janus-Pro: Advanced version with optimized training, expanded data, and larger model scaling for improved performance and stability.
JanusFlow: Integrates rectified flow for efficient and versatile vision-language generation.
Commercial Use: Permitted under the DeepSeek Model License.

Maintenance & Community

The project is actively developed by DeepSeek AI. Contact is available via email (service@deepseek.com) or by raising issues on the repository.

Licensing & Compatibility

Code License: MIT License.
Model License: DeepSeek Model License. Commercial usage is permitted.

Limitations & Caveats

The provided inference code examples utilize torch.bfloat16 and .cuda(), indicating a strong dependency on NVIDIA GPUs and specific PyTorch versions. The text-to-image generation examples involve complex, multi-step processes with specific parameters that may require fine-tuning for optimal results.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

49 stars in the last 30 days