Liquid  by FoundationVision

Multimodal generation research paper

Created 9 months ago
615 stars

Top 53.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Liquid is a scalable and unified autoregressive generation paradigm that integrates multimodal comprehension and generation using a single large language model (LLM). It targets researchers and developers working with multimodal AI, offering a unified approach that eliminates the need for external visual embeddings like CLIP and demonstrates a scaling law where performance degradation from unified training diminishes with model size.

How It Works

Liquid employs a single LLM for both visual and language tasks, enabling a unified token space. This architecture allows visual comprehension and generation to mutually enhance each other. The project highlights a discovered scaling law indicating that the performance drop associated with unified multimodal training is mitigated as model size increases, with models ranging from 0.5B to 32B parameters.

Quick Start & Requirements

  • Install: pip install gradio==4.44.1 gradio_client==1.3.0
  • Prerequisites: HuggingFace transformers library. For low VRAM GPUs (<30GB), use --load_8bit.
  • Demo: Run python app.py in the evaluation directory.
  • Inference:
    • Text-to-text: python inference_t2t.py --model_path Junfeng5/Liquid_V1_7B --prompt "..."
    • Image understanding: python inference_i2t.py --model_path Junfeng5/Liquid_V1_7B --image_path samples/baklava.png --prompt '...'
    • Image generation: python inference_t2i.py --model_path Junfeng5/Liquid_V1_7B --prompt "..." [--load_8bit]
  • Training: See Data.md and TRAIN.md.
  • Resources: Official demo and checkpoints are available.

Highlighted Details

  • Unified autoregressive generation for multimodal tasks.
  • Eliminates reliance on external visual embeddings (e.g., CLIP).
  • Demonstrates a scaling law for multimodal generation across 0.5B-32B parameter models.
  • Supports text-to-image generation with high-quality, arbitrary aspect ratios.
  • Enables mutual enhancement between visual comprehension and generation.

Maintenance & Community

The project is associated with authors from HUST, ByteDance, and HKU. Checkpoints and evaluation scripts for Liquid-7B-IT are released.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Checkpoints for pre-trained models beyond Liquid-7B-IT (0.5B-32B) are not yet released. Training codes are available, but require referring to separate documentation.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 4 months ago
Feedback? Help us improve.