SEED-X  by AILab-CVC

Multimodal AI assistant for real-world applications

created 1 year ago
529 stars

Top 60.6% on sourcepulse

GitHubView on GitHub
Project Summary

SEED-X is a unified multimodal foundation model designed for real-world AI assistants, capable of multi-granularity comprehension and generation. It targets developers and researchers seeking versatile multimodal capabilities, offering instruction-tuned variants for tasks like image editing and story generation.

How It Works

SEED-X unifies multimodal comprehension and generation by integrating a de-tokenizer for image reconstruction from ViT features and supporting multi-turn conversations with images, text, and bounding boxes. Its architecture allows for instruction tuning on diverse datasets, enabling specialized functionalities like high-precision image editing and multimodal long story generation.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies via pip install -r requirements.txt.
  • Prerequisites: Python >= 3.8 (Anaconda recommended), PyTorch >= 2.0.1, NVIDIA GPU with CUDA.
  • Model Weights: Download checkpoints from Hugging Face (e.g., AILab-CVC/seed-x-17b-pretrain) and place them in ./pretrained. Requires Stable Diffusion XL and Qwen-VL-Chat weights.
  • Resources: Training requires multi-node setup with DeepSpeed. Inference demos are available online.
  • Links: Online Demo, SEED-Story, SEED-Data-Edit.

Highlighted Details

  • Supports multi-turn conversations with dynamic image resolutions.
  • Offers specialized models: SEED-X-I (general instruction-tuned), SEED-X-Edit (image editing), SEED-Story (multimodal story generation).
  • Released training code for instruction tuning with DeepSpeed Zero-2/3 support.
  • Includes a large-scale image editing dataset (3.7M samples).

Maintenance & Community

The project is actively developed by AILab-CVC. Latest updates include SEED-Story and SEED-Data-Edit releases.

Licensing & Compatibility

Licensed under Apache License Version 2.0. LLaMA2 parameters are frozen, with LoRA modules optimized during training.

Limitations & Caveats

The general instruction-tuned model SEED-X-I does not support image manipulation. Inference code for SEED-X-Edit was slated for release soon as of the README's last update.

Health Check
Last commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
36 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.