Liquid  by FoundationVision

Multimodal generation research paper

created 7 months ago
608 stars

Top 54.7% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Liquid is a scalable and unified autoregressive generation paradigm that integrates multimodal comprehension and generation using a single large language model (LLM). It targets researchers and developers working with multimodal AI, offering a unified approach that eliminates the need for external visual embeddings like CLIP and demonstrates a scaling law where performance degradation from unified training diminishes with model size.

How It Works

Liquid employs a single LLM for both visual and language tasks, enabling a unified token space. This architecture allows visual comprehension and generation to mutually enhance each other. The project highlights a discovered scaling law indicating that the performance drop associated with unified multimodal training is mitigated as model size increases, with models ranging from 0.5B to 32B parameters.

Quick Start & Requirements

  • Install: pip install gradio==4.44.1 gradio_client==1.3.0
  • Prerequisites: HuggingFace transformers library. For low VRAM GPUs (<30GB), use --load_8bit.
  • Demo: Run python app.py in the evaluation directory.
  • Inference:
    • Text-to-text: python inference_t2t.py --model_path Junfeng5/Liquid_V1_7B --prompt "..."
    • Image understanding: python inference_i2t.py --model_path Junfeng5/Liquid_V1_7B --image_path samples/baklava.png --prompt '...'
    • Image generation: python inference_t2i.py --model_path Junfeng5/Liquid_V1_7B --prompt "..." [--load_8bit]
  • Training: See Data.md and TRAIN.md.
  • Resources: Official demo and checkpoints are available.

Highlighted Details

  • Unified autoregressive generation for multimodal tasks.
  • Eliminates reliance on external visual embeddings (e.g., CLIP).
  • Demonstrates a scaling law for multimodal generation across 0.5B-32B parameter models.
  • Supports text-to-image generation with high-quality, arbitrary aspect ratios.
  • Enables mutual enhancement between visual comprehension and generation.

Maintenance & Community

The project is associated with authors from HUST, ByteDance, and HKU. Checkpoints and evaluation scripts for Liquid-7B-IT are released.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Checkpoints for pre-trained models beyond Liquid-7B-IT (0.5B-32B) are not yet released. Training codes are available, but require referring to separate documentation.

Health Check
Last commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
56 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.