multimodal-search-r1 by EvolvingLMMs-Lab

RL framework for LMMs to perform multimodal search

Created 10 months ago

378 stars

Top 75.4% on SourcePulse

Project Summary

MMSearch-R1 provides an end-to-end Reinforcement Learning (RL) framework designed to enable Large Multimodal Models (LMMs) to conduct on-demand, multi-turn searches using real-world multimodal search tools. This framework is targeted at researchers and developers looking to enhance LMMs' ability to interact with external information sources for more comprehensive and context-aware responses.

How It Works

The framework integrates LMMs with specialized search tools, including an image search tool (SerpAPI-based) and a text search tool (SerpAPI, JINA Reader, and Qwen3-32B summarization). It employs an RL approach to incentivize LMMs to learn optimal search strategies across multiple turns, dynamically deciding when and how to query these tools to gather relevant information. This approach allows for on-demand information retrieval, improving the LMM's ability to handle complex queries that require external data.

Quick Start & Requirements

Installation: Clone the repository with submodules (git clone --recurse-submodules), create and activate a conda environment (conda create -n mmsearch_r1 python=3.10 -y, conda activate mmsearch_r1), and install dependencies (pip3 install -e ./verl, pip3 install vllm==0.8.2, transformers==4.51.0, flash-attn==2.7.4.post1, wandb).
Prerequisites: Python 3.10, vLLM, Transformers, Flash Attention, and Weights & Biases (wandb) for logging. Users are expected to build their own search tool pipelines.
Resources: Requires setting up a wandb API key. Specific hardware requirements (e.g., GPU) are not explicitly detailed but are implied by the use of vLLM and Flash Attention.
Links: Paper, Blog, Model/Data

Highlighted Details

End-to-end RL framework for LMMs.
Supports multi-turn search with real-world tools.
Integrates image and text search capabilities.
Released MMSearch-R1-7B model and FactualVQA dataset.

Maintenance & Community

Active development with recent releases of model, data, paper, and code updates.
Acknowledgements mention contributions from Qwen2.5-VL, veRL, OpenDeepResearcher, and others.
No explicit links to community channels (Discord/Slack) or roadmaps are provided in the README.

Licensing & Compatibility

The README does not explicitly state the license for the code or models. The project acknowledges other repositories, some of which may have specific licenses. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The framework requires users to implement their own search tool pipelines, which adds a significant setup step. While a model and dataset are released, an inference script example is still listed as a ToDo item, suggesting that direct inference might not be immediately straightforward.

multimodal-search-r1 by EvolvingLMMs-Lab

Explore Similar Projects

MMSearch by CaraJ7

FLARE by jzbjyb

Local_Pdf_Chat_RAG by weiwill88

rag-search by thinkany-ai

SearChat by sear-chat

clip-retrieval by rom1504

elasticsearch-labs by elastic

pyserini by castorini

bootcamp by milvus-io

local-deep-researcher by langchain-ai

WeKnora by Tencent

Perplexica by ItzCrazyKns