DeepEyes  by Visual-Agent

Agentic RL training framework

created 4 months ago
682 stars

Top 50.7% on sourcepulse

GitHubView on GitHub
Project Summary

DeepEyes enables large language models to "think with images" by integrating visual information directly into reasoning chains, trained end-to-end via reinforcement learning. This approach targets researchers and developers building advanced multimodal AI agents, offering improved visual grounding, hallucination mitigation, and problem-solving capabilities without requiring supervised fine-tuning for intermediate steps.

How It Works

DeepEyes leverages a reinforcement learning framework built upon VeRL, allowing for asynchronous agent rollouts and dynamic multimodal inputs. It supports various RL algorithms like PPO and GRPO, with modifications for interleaved agentic RL training. The system is designed for efficient training on high-resolution benchmarks and demonstrates emergent thinking patterns such as visual search and tool-assisted verification.

Quick Start & Requirements

  • Install via pip install -e . and run bash scripts/install_deepeyes.sh for additional dependencies.
  • Requires Qwen-2.5-VL-7B-Instruct or Qwen-2.5-VL-32B-Instruct as the foundation model.
  • Training necessitates substantial hardware: minimum 32 GPUs (for 7B) or 64 GPUs (for 32B), and at least 1200GB CPU RAM per node for high-resolution datasets.
  • A vllm serving instance of Qwen-2.5-72B-Instruct is needed for LLM-as-a-judge verification.
  • Refer to the project homepage for detailed setup and training scripts.

Highlighted Details

  • End-to-end RL training guided by outcome rewards, no cold-start or supervised fine-tuning needed.
  • Achieves significant performance gains on high-resolution benchmarks and shows strong generalization.
  • Emergent thinking patterns observed, including visual search and tool-assisted answer verification.
  • Framework supports custom datasets and tools by adding an env_name field and implementing ToolBase subclasses.

Maintenance & Community

The project is actively maintained, with the last sync with the VeRL main branch on April 23, 2025. Future updates will be on the dev branch. Links to community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

This project is released under the Apache License. This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

Training DeepEyes requires significant computational resources (multiple high-end GPUs and substantial RAM), making it inaccessible for users without access to such infrastructure. The project relies on specific foundation models and the VeRL framework, which may introduce dependencies and potential compatibility issues with other systems.

Health Check
Last commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
18
Star History
694 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 4 days ago
Feedback? Help us improve.