Discover and explore top open-source AI tools and projects—updated daily.
Visual-AgentAgentic multimodal model for integrated reasoning and tool use
Top 72.2% on SourcePulse
Summary
DeepEyesV2 is an agentic multimodal model enhancing complex reasoning by integrating visual information into a unified loop of code execution and web search. It targets researchers and engineers seeking advanced AI agents for reliable, multi-step problem-solving, offering significant improvements in tool usage and task adaptation.
How It Works
The model unifies sandboxed code execution (Jupyter-style) and web search (APIs, image cache) within a single reasoning loop. It leverages curated SFT and RL training data, built upon Qwen-2.5-VL foundation models. Reinforcement learning enables sophisticated tool combinations and adaptive, context-aware invocation, yielding strong reasoning and tool-use capabilities.
Quick Start & Requirements
cd reinforcement_learning, pip install -e ., scripts/install_deepeyes.sh.Highlighted Details
Maintenance & Community
No explicit information on community channels (Discord, Slack), active contributors beyond authors, or project roadmap is provided in the README.
Licensing & Compatibility
Released under the Apache 2.0 license, which is generally permissive for commercial use and integration into closed-source projects.
Limitations & Caveats
Training DeepEyesV2 is resource-intensive, demanding substantial GPU and CPU RAM. Deploying multiple code servers and high-performance machines is recommended for RL training to mitigate bandwidth saturation and network timeouts. Setup relies on external projects like LLaMA-Factory and VeRL.
6 days ago
Inactive
grapeot