SEED  by AILab-CVC

Multimodal LLM research paper with visual tokenization

Created 2 years ago
623 stars

Top 53.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

SEED-LLaMA is an open-source project providing the official implementation for SEED-LLaMA, a multimodal large language model capable of both visual comprehension and generation. It is designed for researchers and developers working on integrating vision and language capabilities into AI models, offering emergent abilities like multi-turn multimodal generation.

How It Works

SEED-LLaMA leverages a proprietary SEED tokenizer to convert visual signals into discrete visual tokens. This approach captures essential semantics while maintaining a 1D causal dependency, enabling seamless integration with LLMs. The model is built upon LLaMA2, with specific versions (8B and 14B) available, and supports efficient multi-node training via DeepSpeed.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies using pip install -r requirements.txt.
  • Prerequisites: Python >= 3.8, PyTorch >= 1.11.0, NVIDIA GPU with CUDA.
  • Demo: Local demos require a single 16GB (8B) or 24GB (14B) GPU. Backend and frontend scripts are provided to launch the Gradio demo.
  • Documentation: Technical details are available in linked arXiv papers.

Highlighted Details

  • Supports multimodal comprehension and generation, including interleaved image-text content.
  • Features an instruction-tuned model that can generate informative text and images in a single response.
  • Offers an upgraded version, SEED-X, with continuous visual embeddings for multi-granularity comprehension.
  • Model weights for SEED tokenizer and SEED-LLaMA (8B/14B) are available.

Maintenance & Community

The project is actively developed by Tencent AI Lab and ARC Lab. Updates are regularly posted, including the release of SEED-X and training code for SEED-LLaMA. Inquiries can be directed to seed-x@googlegroups.com.

Licensing & Compatibility

SEED is released under the Apache License Version 2.0. SEED-LLaMA is released under the original license of LLaMA2.

Limitations & Caveats

The project is described as "still in progress." While the instruction-tuned model can generate image-text interleaved content, the released SFT model does not possess this specific feature, as it was handled separately during instruction tuning.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.