MultimodalRAG  by singularguy

Multimodal RAG system for text and image data processing

Created 8 months ago
282 stars

Top 92.6% on SourcePulse

GitHubView on GitHub
Project Summary

This project implements a multimodal Retrieval Augmented Generation (RAG) system designed to process and query both text and image data. It targets engineers and researchers seeking to build applications that can understand and respond to queries involving mixed media. The system offers a unified approach to indexing, retrieving, and generating responses across text and images, simplifying multimodal data interaction.

How It Works

The system leverages CLIP (openai/clip-vit-base-patch32) to generate unified vector embeddings for text descriptions and associated images. These embeddings are indexed using Faiss (IndexIDMap2 + IndexFlatIP) for efficient similarity search. Document metadata is persisted using SQLite, and the Faiss index is saved to disk. Retrieved information is then passed to Zhipu AI (glm-4-flash) for contextually relevant response generation. This approach allows for flexible querying using pure text, pure images, or a combination of both, enabling a cohesive multimodal RAG experience.

Quick Start & Requirements

  • Installation: Clone the repository, create a Python 3.9+ virtual environment (conda recommended), and install dependencies via pip install -r requirements.txt. Note: faiss-cpu is included; for GPU support, install faiss-gpu after configuring CUDA.
  • Prerequisites: Python 3.9 or higher, Zhipu AI API key (configured in a .env file as ZHIPUAI_API_KEY).
  • Running: Prepare data.json (containing document metadata) and place corresponding images in the images/ directory. Execute the main script using python MultimodalRAG.py or launch the Jupyter notebook MultimodalRAG.ipynb.
  • Links: requirements.txt, data.json example provided in README.

Highlighted Details

  • Supports indexing and retrieval for text, images, and combined multimodal queries.
  • Utilizes CLIP for generating unified text and image vector representations.
  • Employs Faiss for high-performance similarity search.
  • Features a modular design with distinct classes for encoding, indexing, retrieval, and generation.
  • Persists index and metadata to disk for state management.

Maintenance & Community

The project shows recent activity with update logs from April-May 2024. Contribution is welcomed via GitHub Issues and Pull Requests. Links to the author's sharing platforms (Xiaohongshu, Feishu Docs) are provided for technical discussions.

Licensing & Compatibility

The project is released under the MIT License, which is permissive and generally suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

For large-scale datasets, performance may necessitate using faiss-gpu or more advanced Faiss index types. The current multimodal fusion strategy relies on simple vector averaging, and more sophisticated methods could be explored. Enhancements in error handling, logging, and potentially upgrading to direct multimodal LLM processing (e.g., GLM-4V) are suggested for future improvements. SQLite may require replacement with a dedicated vector database for production scalability.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
22 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wei-Lin Chiang Wei-Lin Chiang(Cofounder of LMArena), and
13 more.

awesome-tensor-compilers by merrymercy

0.1%
3k
Curated list of tensor compiler projects and papers
Created 5 years ago
Updated 1 year ago
Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
14 more.

BIG-bench by google

0.1%
3k
Collaborative benchmark for probing and extrapolating LLM capabilities
Created 5 years ago
Updated 1 year ago
Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
14 more.

simpletransformers by ThilinaRajapakse

0%
4k
Rapid NLP task implementation
Created 6 years ago
Updated 4 months ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

text-to-text-transfer-transformer by google-research

0.1%
6k
Unified text-to-text transformer for NLP research
Created 6 years ago
Updated 2 days ago
Starred by Vaibhav Nivargi Vaibhav Nivargi(Cofounder of Moveworks), Chuan Li Chuan Li(Chief Scientific Officer at Lambda), and
5 more.

awesome-mlops by visenger

0.1%
14k
Curated MLOps knowledge hub
Created 5 years ago
Updated 1 year ago
Feedback? Help us improve.