DeepEyesV2  by Visual-Agent

Agentic multimodal model for integrated reasoning and tool use

Created 1 month ago
400 stars

Top 72.2% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

DeepEyesV2 is an agentic multimodal model enhancing complex reasoning by integrating visual information into a unified loop of code execution and web search. It targets researchers and engineers seeking advanced AI agents for reliable, multi-step problem-solving, offering significant improvements in tool usage and task adaptation.

How It Works

The model unifies sandboxed code execution (Jupyter-style) and web search (APIs, image cache) within a single reasoning loop. It leverages curated SFT and RL training data, built upon Qwen-2.5-VL foundation models. Reinforcement learning enables sophisticated tool combinations and adaptive, context-aware invocation, yielding strong reasoning and tool-use capabilities.

Quick Start & Requirements

  • Installation: Cold Start uses LLaMA-Factory; environment setup refers to its docs. Reinforcement Training: cd reinforcement_learning, pip install -e ., scripts/install_deepeyes.sh.
  • Prerequisites: Models: Qwen-2.5-VL-7B/32B-Instruct (foundation), Qwen-2.5-72B-Instruct (judge). Hardware: High-performance machines for code servers. Training requires >= 32 GPUs (7B) or >= 64 GPUs (32B), and >= 1200GB CPU RAM per node. Software: Docker for code sandbox, Ray cluster for training, wandb for visualization. Data: SFT data (https://huggingface.co/datasets/honglyhly/DeepEyesV2_SFT), RL data (https://huggingface.co/datasets/honglyhly/DeepEyesV2_RL), MMSearch-R1 cache data.
  • Links: Evaluation details are available.

Highlighted Details

  • Code Sandbox: Executes Jupyter-style code safely within Docker containers.
  • Online Search: Integrates text search via online services and image search using the MMSearch-R1 cache, bypassing RAG.
  • LLM-as-a-Judge: Employs Qwen models for verification tasks.
  • Reinforcement Learning: Facilitates complex tool chaining and adaptive tool invocation.
  • Curated Corpus: Training data is rigorously filtered and cleaned for SFT and RL.

Maintenance & Community

No explicit information on community channels (Discord, Slack), active contributors beyond authors, or project roadmap is provided in the README.

Licensing & Compatibility

Released under the Apache 2.0 license, which is generally permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

Training DeepEyesV2 is resource-intensive, demanding substantial GPU and CPU RAM. Deploying multiple code servers and high-performance machines is recommended for RL training to mitigate bandwidth saturation and network timeouts. Setup relies on external projects like LLaMA-Factory and VeRL.

Health Check
Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
6
Star History
401 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.