SEED  by AILab-CVC

Multimodal LLM research paper with visual tokenization

created 2 years ago
618 stars

Top 54.2% on sourcepulse

GitHubView on GitHub
Project Summary

SEED-LLaMA is an open-source project providing the official implementation for SEED-LLaMA, a multimodal large language model capable of both visual comprehension and generation. It is designed for researchers and developers working on integrating vision and language capabilities into AI models, offering emergent abilities like multi-turn multimodal generation.

How It Works

SEED-LLaMA leverages a proprietary SEED tokenizer to convert visual signals into discrete visual tokens. This approach captures essential semantics while maintaining a 1D causal dependency, enabling seamless integration with LLMs. The model is built upon LLaMA2, with specific versions (8B and 14B) available, and supports efficient multi-node training via DeepSpeed.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies using pip install -r requirements.txt.
  • Prerequisites: Python >= 3.8, PyTorch >= 1.11.0, NVIDIA GPU with CUDA.
  • Demo: Local demos require a single 16GB (8B) or 24GB (14B) GPU. Backend and frontend scripts are provided to launch the Gradio demo.
  • Documentation: Technical details are available in linked arXiv papers.

Highlighted Details

  • Supports multimodal comprehension and generation, including interleaved image-text content.
  • Features an instruction-tuned model that can generate informative text and images in a single response.
  • Offers an upgraded version, SEED-X, with continuous visual embeddings for multi-granularity comprehension.
  • Model weights for SEED tokenizer and SEED-LLaMA (8B/14B) are available.

Maintenance & Community

The project is actively developed by Tencent AI Lab and ARC Lab. Updates are regularly posted, including the release of SEED-X and training code for SEED-LLaMA. Inquiries can be directed to seed-x@googlegroups.com.

Licensing & Compatibility

SEED is released under the Apache License Version 2.0. SEED-LLaMA is released under the original license of LLaMA2.

Limitations & Caveats

The project is described as "still in progress." While the instruction-tuned model can generate image-text interleaved content, the released SFT model does not possess this specific feature, as it was handled separately during instruction tuning.

Health Check
Last commit

10 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.