Liquid by FoundationVision

Multimodal generation research paper

Created 1 year ago

636 stars

Top 52.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeffrey Morgan

Cofounder of Ollama

Project Summary

Liquid is a scalable and unified autoregressive generation paradigm that integrates multimodal comprehension and generation using a single large language model (LLM). It targets researchers and developers working with multimodal AI, offering a unified approach that eliminates the need for external visual embeddings like CLIP and demonstrates a scaling law where performance degradation from unified training diminishes with model size.

How It Works

Liquid employs a single LLM for both visual and language tasks, enabling a unified token space. This architecture allows visual comprehension and generation to mutually enhance each other. The project highlights a discovered scaling law indicating that the performance drop associated with unified multimodal training is mitigated as model size increases, with models ranging from 0.5B to 32B parameters.

Quick Start & Requirements

Install: pip install gradio==4.44.1 gradio_client==1.3.0
Prerequisites: HuggingFace transformers library. For low VRAM GPUs (<30GB), use --load_8bit.
Demo: Run python app.py in the evaluation directory.
Inference:
- Text-to-text: python inference_t2t.py --model_path Junfeng5/Liquid_V1_7B --prompt "..."
- Image understanding: python inference_i2t.py --model_path Junfeng5/Liquid_V1_7B --image_path samples/baklava.png --prompt '...'
- Image generation: python inference_t2i.py --model_path Junfeng5/Liquid_V1_7B --prompt "..." [--load_8bit]
Training: See Data.md and TRAIN.md.
Resources: Official demo and checkpoints are available.

Highlighted Details

Unified autoregressive generation for multimodal tasks.
Eliminates reliance on external visual embeddings (e.g., CLIP).
Demonstrates a scaling law for multimodal generation across 0.5B-32B parameter models.
Supports text-to-image generation with high-quality, arbitrary aspect ratios.
Enables mutual enhancement between visual comprehension and generation.

Maintenance & Community

The project is associated with authors from HUST, ByteDance, and HKU. Checkpoints and evaluation scripts for Liquid-7B-IT are released.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

Checkpoints for pre-trained models beyond Liquid-7B-IT (0.5B-32B) are not yet released. Training codes are available, but require referring to separate documentation.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days