SEED-X  by AILab-CVC

Multimodal AI assistant for real-world applications

Created 1 year ago
544 stars

Top 58.7% on SourcePulse

GitHubView on GitHub
Project Summary

SEED-X is a unified multimodal foundation model designed for real-world AI assistants, capable of multi-granularity comprehension and generation. It targets developers and researchers seeking versatile multimodal capabilities, offering instruction-tuned variants for tasks like image editing and story generation.

How It Works

SEED-X unifies multimodal comprehension and generation by integrating a de-tokenizer for image reconstruction from ViT features and supporting multi-turn conversations with images, text, and bounding boxes. Its architecture allows for instruction tuning on diverse datasets, enabling specialized functionalities like high-precision image editing and multimodal long story generation.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies via pip install -r requirements.txt.
  • Prerequisites: Python >= 3.8 (Anaconda recommended), PyTorch >= 2.0.1, NVIDIA GPU with CUDA.
  • Model Weights: Download checkpoints from Hugging Face (e.g., AILab-CVC/seed-x-17b-pretrain) and place them in ./pretrained. Requires Stable Diffusion XL and Qwen-VL-Chat weights.
  • Resources: Training requires multi-node setup with DeepSpeed. Inference demos are available online.
  • Links: Online Demo, SEED-Story, SEED-Data-Edit.

Highlighted Details

  • Supports multi-turn conversations with dynamic image resolutions.
  • Offers specialized models: SEED-X-I (general instruction-tuned), SEED-X-Edit (image editing), SEED-Story (multimodal story generation).
  • Released training code for instruction tuning with DeepSpeed Zero-2/3 support.
  • Includes a large-scale image editing dataset (3.7M samples).

Maintenance & Community

The project is actively developed by AILab-CVC. Latest updates include SEED-Story and SEED-Data-Edit releases.

Licensing & Compatibility

Licensed under Apache License Version 2.0. LLaMA2 parameters are frozen, with LoRA modules optimized during training.

Limitations & Caveats

The general instruction-tuned model SEED-X-I does not support image manipulation. Inference code for SEED-X-Edit was slated for release soon as of the README's last update.

Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.