Discover and explore top open-source AI tools and projects—updated daily.
singularguyMultimodal RAG system for text and image data processing
Top 92.6% on SourcePulse
This project implements a multimodal Retrieval Augmented Generation (RAG) system designed to process and query both text and image data. It targets engineers and researchers seeking to build applications that can understand and respond to queries involving mixed media. The system offers a unified approach to indexing, retrieving, and generating responses across text and images, simplifying multimodal data interaction.
How It Works
The system leverages CLIP (openai/clip-vit-base-patch32) to generate unified vector embeddings for text descriptions and associated images. These embeddings are indexed using Faiss (IndexIDMap2 + IndexFlatIP) for efficient similarity search. Document metadata is persisted using SQLite, and the Faiss index is saved to disk. Retrieved information is then passed to Zhipu AI (glm-4-flash) for contextually relevant response generation. This approach allows for flexible querying using pure text, pure images, or a combination of both, enabling a cohesive multimodal RAG experience.
Quick Start & Requirements
pip install -r requirements.txt. Note: faiss-cpu is included; for GPU support, install faiss-gpu after configuring CUDA..env file as ZHIPUAI_API_KEY).data.json (containing document metadata) and place corresponding images in the images/ directory. Execute the main script using python MultimodalRAG.py or launch the Jupyter notebook MultimodalRAG.ipynb.requirements.txt, data.json example provided in README.Highlighted Details
Maintenance & Community
The project shows recent activity with update logs from April-May 2024. Contribution is welcomed via GitHub Issues and Pull Requests. Links to the author's sharing platforms (Xiaohongshu, Feishu Docs) are provided for technical discussions.
Licensing & Compatibility
The project is released under the MIT License, which is permissive and generally suitable for commercial use and integration into closed-source projects.
Limitations & Caveats
For large-scale datasets, performance may necessitate using faiss-gpu or more advanced Faiss index types. The current multimodal fusion strategy relies on simple vector averaging, and more sophisticated methods could be explored. Enhancements in error handling, logging, and potentially upgrading to direct multimodal LLM processing (e.g., GLM-4V) are suggested for future improvements. SQLite may require replacement with a dedicated vector database for production scalability.
8 months ago
Inactive
merrymercy
Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab),
google
grahamjenson
ThilinaRajapakse
google-research
triton-inference-server
tensorflow
visenger
PaddlePaddle