Discover and explore top open-source AI tools and projects—updated daily.
RL framework for LMMs to perform multimodal search
Top 86.3% on SourcePulse
MMSearch-R1 provides an end-to-end Reinforcement Learning (RL) framework designed to enable Large Multimodal Models (LMMs) to conduct on-demand, multi-turn searches using real-world multimodal search tools. This framework is targeted at researchers and developers looking to enhance LMMs' ability to interact with external information sources for more comprehensive and context-aware responses.
How It Works
The framework integrates LMMs with specialized search tools, including an image search tool (SerpAPI-based) and a text search tool (SerpAPI, JINA Reader, and Qwen3-32B summarization). It employs an RL approach to incentivize LMMs to learn optimal search strategies across multiple turns, dynamically deciding when and how to query these tools to gather relevant information. This approach allows for on-demand information retrieval, improving the LMM's ability to handle complex queries that require external data.
Quick Start & Requirements
git clone --recurse-submodules
), create and activate a conda environment (conda create -n mmsearch_r1 python=3.10 -y
, conda activate mmsearch_r1
), and install dependencies (pip3 install -e ./verl
, pip3 install vllm==0.8.2
, transformers==4.51.0
, flash-attn==2.7.4.post1
, wandb
).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The framework requires users to implement their own search tool pipelines, which adds a significant setup step. While a model and dataset are released, an inference script example is still listed as a ToDo item, suggesting that direct inference might not be immediately straightforward.
3 weeks ago
1 day