colpali-cookbooks by tonywu71

Cookbooks for multimodal RAG with VLMs

Created 1 year ago

343 stars

Top 80.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jesse Clark

Cofounder of Marqo

Project Summary

This repository provides recipes for learning, fine-tuning, and adapting ColPali, a Vision-Language Model (VLM) for efficient multimodal RAG. It targets researchers and practitioners seeking to leverage visual document content (layout, charts) alongside text for retrieval, bypassing traditional OCR and layout analysis pipelines. The primary benefit is a unified model for document understanding and retrieval.

How It Works

ColPali constructs multi-vector embeddings by projecting ViT output patches from PaliGemma-3B into a visual space. It trains these embeddings to maximize similarity with query embeddings, following the ColBERT methodology. This approach integrates textual and visual document features into a single retrieval system, simplifying the RAG pipeline.

Quick Start & Requirements

Install/Run: Open notebooks via the provided Colab links or clone the repository and run locally with Jupyter Notebook/IDE.
Prerequisites: Access to Google Colab or a local Python environment. Specific model requirements (e.g., PaliGemma-3B) are handled within the notebooks.
Resources: Notebooks are designed to run on free-tier Colab GPUs (e.g., T4), indicating VRAM efficiency.
Links: ColPali Engine, ViDoRe Benchmark

Highlighted Details

Fine-tuning with LoRA and optional 4-bit/8-bit quantization.
Interpretability notebooks for generating similarity maps.
Unified RAG pipeline with adapter hot-swapping, saving VRAM.
🤗 transformers-native implementation for inference and scoring.

Maintenance & Community

The project is associated with authors of the ColPali paper, indicating active research. Further community engagement channels are not explicitly listed in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The repository focuses on ColPali and related models; broader multimodal RAG solutions are not covered. The "cookbooks" nature implies these are examples and may require adaptation for production systems.

Health Check

Last Commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days