MultimodalRAG by singularguy

Multimodal RAG system for text and image data processing

Created 8 months ago

282 stars

Top 92.6% on SourcePulse

Project Summary

This project implements a multimodal Retrieval Augmented Generation (RAG) system designed to process and query both text and image data. It targets engineers and researchers seeking to build applications that can understand and respond to queries involving mixed media. The system offers a unified approach to indexing, retrieving, and generating responses across text and images, simplifying multimodal data interaction.

How It Works

The system leverages CLIP (openai/clip-vit-base-patch32) to generate unified vector embeddings for text descriptions and associated images. These embeddings are indexed using Faiss (IndexIDMap2 + IndexFlatIP) for efficient similarity search. Document metadata is persisted using SQLite, and the Faiss index is saved to disk. Retrieved information is then passed to Zhipu AI (glm-4-flash) for contextually relevant response generation. This approach allows for flexible querying using pure text, pure images, or a combination of both, enabling a cohesive multimodal RAG experience.

Quick Start & Requirements

Installation: Clone the repository, create a Python 3.9+ virtual environment (conda recommended), and install dependencies via pip install -r requirements.txt. Note: faiss-cpu is included; for GPU support, install faiss-gpu after configuring CUDA.
Prerequisites: Python 3.9 or higher, Zhipu AI API key (configured in a .env file as ZHIPUAI_API_KEY).
Running: Prepare data.json (containing document metadata) and place corresponding images in the images/ directory. Execute the main script using python MultimodalRAG.py or launch the Jupyter notebook MultimodalRAG.ipynb.
Links: requirements.txt, data.json example provided in README.

Highlighted Details

Supports indexing and retrieval for text, images, and combined multimodal queries.
Utilizes CLIP for generating unified text and image vector representations.
Employs Faiss for high-performance similarity search.
Features a modular design with distinct classes for encoding, indexing, retrieval, and generation.
Persists index and metadata to disk for state management.

Maintenance & Community

The project shows recent activity with update logs from April-May 2024. Contribution is welcomed via GitHub Issues and Pull Requests. Links to the author's sharing platforms (Xiaohongshu, Feishu Docs) are provided for technical discussions.

Licensing & Compatibility

The project is released under the MIT License, which is permissive and generally suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

For large-scale datasets, performance may necessitate using faiss-gpu or more advanced Faiss index types. The current multimodal fusion strategy relies on simple vector averaging, and more sophisticated methods could be explored. Enhancements in error handling, logging, and potentially upgrading to direct multimodal LLM processing (e.g., GLM-4V) are suggested for future improvements. SQLite may require replacement with a dedicated vector database for production scalability.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

22 stars in the last 30 days