Multimodal generation research paper
Top 54.7% on sourcepulse
Liquid is a scalable and unified autoregressive generation paradigm that integrates multimodal comprehension and generation using a single large language model (LLM). It targets researchers and developers working with multimodal AI, offering a unified approach that eliminates the need for external visual embeddings like CLIP and demonstrates a scaling law where performance degradation from unified training diminishes with model size.
How It Works
Liquid employs a single LLM for both visual and language tasks, enabling a unified token space. This architecture allows visual comprehension and generation to mutually enhance each other. The project highlights a discovered scaling law indicating that the performance drop associated with unified multimodal training is mitigated as model size increases, with models ranging from 0.5B to 32B parameters.
Quick Start & Requirements
pip install gradio==4.44.1 gradio_client==1.3.0
transformers
library. For low VRAM GPUs (<30GB), use --load_8bit
.python app.py
in the evaluation
directory.python inference_t2t.py --model_path Junfeng5/Liquid_V1_7B --prompt "..."
python inference_i2t.py --model_path Junfeng5/Liquid_V1_7B --image_path samples/baklava.png --prompt '...'
python inference_t2i.py --model_path Junfeng5/Liquid_V1_7B --prompt "..." [--load_8bit]
Data.md
and TRAIN.md
.Highlighted Details
Maintenance & Community
The project is associated with authors from HUST, ByteDance, and HKU. Checkpoints and evaluation scripts for Liquid-7B-IT are released.
Licensing & Compatibility
Licensed under the MIT License, permitting commercial use and integration with closed-source projects.
Limitations & Caveats
Checkpoints for pre-trained models beyond Liquid-7B-IT (0.5B-32B) are not yet released. Training codes are available, but require referring to separate documentation.
3 months ago
1 week