SEED-X  by AILab-CVC

Multimodal AI assistant for real-world applications

Created 1 year ago
552 stars

Top 57.9% on SourcePulse

GitHubView on GitHub
Project Summary

SEED-X is a unified multimodal foundation model designed for real-world AI assistants, capable of multi-granularity comprehension and generation. It targets developers and researchers seeking versatile multimodal capabilities, offering instruction-tuned variants for tasks like image editing and story generation.

How It Works

SEED-X unifies multimodal comprehension and generation by integrating a de-tokenizer for image reconstruction from ViT features and supporting multi-turn conversations with images, text, and bounding boxes. Its architecture allows for instruction tuning on diverse datasets, enabling specialized functionalities like high-precision image editing and multimodal long story generation.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies via pip install -r requirements.txt.
  • Prerequisites: Python >= 3.8 (Anaconda recommended), PyTorch >= 2.0.1, NVIDIA GPU with CUDA.
  • Model Weights: Download checkpoints from Hugging Face (e.g., AILab-CVC/seed-x-17b-pretrain) and place them in ./pretrained. Requires Stable Diffusion XL and Qwen-VL-Chat weights.
  • Resources: Training requires multi-node setup with DeepSpeed. Inference demos are available online.
  • Links: Online Demo, SEED-Story, SEED-Data-Edit.

Highlighted Details

  • Supports multi-turn conversations with dynamic image resolutions.
  • Offers specialized models: SEED-X-I (general instruction-tuned), SEED-X-Edit (image editing), SEED-Story (multimodal story generation).
  • Released training code for instruction tuning with DeepSpeed Zero-2/3 support.
  • Includes a large-scale image editing dataset (3.7M samples).

Maintenance & Community

The project is actively developed by AILab-CVC. Latest updates include SEED-Story and SEED-Data-Edit releases.

Licensing & Compatibility

Licensed under Apache License Version 2.0. LLaMA2 parameters are frozen, with LoRA modules optimized during training.

Limitations & Caveats

The general instruction-tuned model SEED-X-I does not support image manipulation. Inference code for SEED-X-Edit was slated for release soon as of the README's last update.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 8 months ago
Feedback? Help us improve.