Agentic RL training framework
Top 50.7% on sourcepulse
DeepEyes enables large language models to "think with images" by integrating visual information directly into reasoning chains, trained end-to-end via reinforcement learning. This approach targets researchers and developers building advanced multimodal AI agents, offering improved visual grounding, hallucination mitigation, and problem-solving capabilities without requiring supervised fine-tuning for intermediate steps.
How It Works
DeepEyes leverages a reinforcement learning framework built upon VeRL, allowing for asynchronous agent rollouts and dynamic multimodal inputs. It supports various RL algorithms like PPO and GRPO, with modifications for interleaved agentic RL training. The system is designed for efficient training on high-resolution benchmarks and demonstrates emergent thinking patterns such as visual search and tool-assisted verification.
Quick Start & Requirements
pip install -e .
and run bash scripts/install_deepeyes.sh
for additional dependencies.Highlighted Details
env_name
field and implementing ToolBase
subclasses.Maintenance & Community
The project is actively maintained, with the last sync with the VeRL main branch on April 23, 2025. Future updates will be on the dev branch. Links to community channels or roadmaps are not explicitly provided in the README.
Licensing & Compatibility
This project is released under the Apache License. This license is permissive and generally compatible with commercial use and closed-source linking.
Limitations & Caveats
Training DeepEyes requires significant computational resources (multiple high-end GPUs and substantial RAM), making it inaccessible for users without access to such infrastructure. The project relies on specific foundation models and the VeRL framework, which may introduce dependencies and potential compatibility issues with other systems.
3 weeks ago
Inactive