Vision-DeepResearch by Osilly

Multimodal LLM for deep research and extensive search

Created 3 weeks ago

New!

367 stars

Top 77.2% on SourcePulse

Project Summary

This project introduces Vision-DeepResearch, a multimodal large language model (MLLM) designed for extended, long-horizon deep-research tasks. It targets researchers and advanced users needing to perform complex, multi-turn reasoning and extensive web-based information retrieval. The primary benefit is enabling MLLMs to engage in significantly deeper and more interactive research processes than previously possible.

How It Works

Vision-DeepResearch extends traditional MLLM capabilities by enabling dozens of reasoning turns and hundreds of search engine interactions. Its core innovation lies in facilitating a "deep-research" workflow, allowing the model to iteratively refine queries, analyze search results, and synthesize information over extended periods. This approach is advantageous for tackling complex problems that require sustained investigation and information gathering, moving beyond single-turn question-answering.

Quick Start & Requirements

Setup involves cloning the repository and installing several core components: verl, Megatron-LM, mbridge, and rllm via pip. Data preparation requires converting provided Parquet datasets to JSONL format using provided scripts. Training involves distinct SFT and RL phases, with RL training necessitating the deployment of vLLM-served Extract and Judge models, and configuration of API keys (SERP_API_KEY, JINA_API_KEY) and OSS settings. Evaluation requires running an OpenAI-compatible API service for the model and configuring evaluation scripts with specific data formats and model endpoints. Significant GPU resources (e.g., 8x GPUs for vLLM serving) and specific dependencies like vLLM are required.

Highlighted Details

Vision-DeepResearch-8B and -30B models demonstrate substantial performance gains across benchmarks like VDR, FVQA, MMSearch, LiveVQA, and BC-VL, particularly within agentic workflows, outperforming leading models like GPT-5 and Gemini 2.5 Pro in specific metrics.
The project provides both SFT (Supervised Fine-Tuning) and RL (Reinforcement Learning) codebases, enabling advanced training methodologies for multimodal research agents.
Associated datasets, including a Cold-start Dataset (demo), RL Dataset (demo), and the VDR-Bench benchmark, are available on Hugging Face.

Maintenance & Community

The project has seen recent releases of SFT and RL code, along with datasets and an 8B model, as of early February 2026. No explicit community channels (e.g., Discord, Slack) or detailed roadmap are provided in the README.

Licensing & Compatibility

The README does not specify a software license. This lack of explicit licensing information presents a significant adoption blocker, as it leaves the terms for use, modification, and distribution unclear, particularly for commercial applications.

Limitations & Caveats

The Vision-DeepResearch-30B-A3B model weights are listed as "coming soon." While demo datasets are available, the full datasets might require further preparation or are not yet fully released. The setup process is complex, requiring the installation of multiple specialized libraries and the deployment of external model services, indicating a steep learning curve and potentially high resource requirements for users.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

370 stars in the last 27 days