multimodal-search-r1  by EvolvingLMMs-Lab

RL framework for LMMs to perform multimodal search

Created 6 months ago
312 stars

Top 86.3% on SourcePulse

GitHubView on GitHub
Project Summary

MMSearch-R1 provides an end-to-end Reinforcement Learning (RL) framework designed to enable Large Multimodal Models (LMMs) to conduct on-demand, multi-turn searches using real-world multimodal search tools. This framework is targeted at researchers and developers looking to enhance LMMs' ability to interact with external information sources for more comprehensive and context-aware responses.

How It Works

The framework integrates LMMs with specialized search tools, including an image search tool (SerpAPI-based) and a text search tool (SerpAPI, JINA Reader, and Qwen3-32B summarization). It employs an RL approach to incentivize LMMs to learn optimal search strategies across multiple turns, dynamically deciding when and how to query these tools to gather relevant information. This approach allows for on-demand information retrieval, improving the LMM's ability to handle complex queries that require external data.

Quick Start & Requirements

  • Installation: Clone the repository with submodules (git clone --recurse-submodules), create and activate a conda environment (conda create -n mmsearch_r1 python=3.10 -y, conda activate mmsearch_r1), and install dependencies (pip3 install -e ./verl, pip3 install vllm==0.8.2, transformers==4.51.0, flash-attn==2.7.4.post1, wandb).
  • Prerequisites: Python 3.10, vLLM, Transformers, Flash Attention, and Weights & Biases (wandb) for logging. Users are expected to build their own search tool pipelines.
  • Resources: Requires setting up a wandb API key. Specific hardware requirements (e.g., GPU) are not explicitly detailed but are implied by the use of vLLM and Flash Attention.
  • Links: Paper, Blog, Model/Data

Highlighted Details

  • End-to-end RL framework for LMMs.
  • Supports multi-turn search with real-world tools.
  • Integrates image and text search capabilities.
  • Released MMSearch-R1-7B model and FactualVQA dataset.

Maintenance & Community

  • Active development with recent releases of model, data, paper, and code updates.
  • Acknowledgements mention contributions from Qwen2.5-VL, veRL, OpenDeepResearcher, and others.
  • No explicit links to community channels (Discord/Slack) or roadmaps are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state the license for the code or models. The project acknowledges other repositories, some of which may have specific licenses. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The framework requires users to implement their own search tool pipelines, which adds a significant setup step. While a model and dataset are released, an inference script example is still listed as a ToDo item, suggesting that direct inference might not be immediately straightforward.

Health Check
Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
8
Star History
22 stars in the last 30 days

Explore Similar Projects

Starred by John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), Chenlin Meng Chenlin Meng(Cofounder of Pika), and
9 more.

clip-retrieval by rom1504

0.2%
3k
CLIP retrieval system for semantic search
Created 4 years ago
Updated 1 month ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Taranjeet Singh Taranjeet Singh(Cofounder of Mem0), and
8 more.

Perplexica by ItzCrazyKns

5.7%
25k
AI-powered search engine alternative
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.