DeepEyesV2 by Visual-Agent

Agentic multimodal model for integrated reasoning and tool use

Created 3 months ago

526 stars

Top 59.9% on SourcePulse

Project Summary

Summary

DeepEyesV2 is an agentic multimodal model enhancing complex reasoning by integrating visual information into a unified loop of code execution and web search. It targets researchers and engineers seeking advanced AI agents for reliable, multi-step problem-solving, offering significant improvements in tool usage and task adaptation.

How It Works

The model unifies sandboxed code execution (Jupyter-style) and web search (APIs, image cache) within a single reasoning loop. It leverages curated SFT and RL training data, built upon Qwen-2.5-VL foundation models. Reinforcement learning enables sophisticated tool combinations and adaptive, context-aware invocation, yielding strong reasoning and tool-use capabilities.

Quick Start & Requirements

Installation: Cold Start uses LLaMA-Factory; environment setup refers to its docs. Reinforcement Training: cd reinforcement_learning, pip install -e ., scripts/install_deepeyes.sh.
Prerequisites: Models: Qwen-2.5-VL-7B/32B-Instruct (foundation), Qwen-2.5-72B-Instruct (judge). Hardware: High-performance machines for code servers. Training requires >= 32 GPUs (7B) or >= 64 GPUs (32B), and >= 1200GB CPU RAM per node. Software: Docker for code sandbox, Ray cluster for training, wandb for visualization. Data: SFT data (https://huggingface.co/datasets/honglyhly/DeepEyesV2_SFT), RL data (https://huggingface.co/datasets/honglyhly/DeepEyesV2_RL), MMSearch-R1 cache data.
Links: Evaluation details are available.

Highlighted Details

Code Sandbox: Executes Jupyter-style code safely within Docker containers.
Online Search: Integrates text search via online services and image search using the MMSearch-R1 cache, bypassing RAG.
LLM-as-a-Judge: Employs Qwen models for verification tasks.
Reinforcement Learning: Facilitates complex tool chaining and adaptive tool invocation.
Curated Corpus: Training data is rigorously filtered and cleaned for SFT and RL.

Maintenance & Community

No explicit information on community channels (Discord, Slack), active contributors beyond authors, or project roadmap is provided in the README.

Licensing & Compatibility

Released under the Apache 2.0 license, which is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Training DeepEyesV2 is resource-intensive, demanding substantial GPU and CPU RAM. Deploying multiple code servers and high-performance machines is recommended for RL training to mitigate bandwidth saturation and network timeouts. Setup relies on external projects like LLaMA-Factory and VeRL.

Health Check

Last Commit

4 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

36 stars in the last 30 days