DeepEyes by Visual-Agent

Agentic RL training framework

Created 10 months ago

1,064 stars

Top 35.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

DeepEyes enables large language models to "think with images" by integrating visual information directly into reasoning chains, trained end-to-end via reinforcement learning. This approach targets researchers and developers building advanced multimodal AI agents, offering improved visual grounding, hallucination mitigation, and problem-solving capabilities without requiring supervised fine-tuning for intermediate steps.

How It Works

DeepEyes leverages a reinforcement learning framework built upon VeRL, allowing for asynchronous agent rollouts and dynamic multimodal inputs. It supports various RL algorithms like PPO and GRPO, with modifications for interleaved agentic RL training. The system is designed for efficient training on high-resolution benchmarks and demonstrates emergent thinking patterns such as visual search and tool-assisted verification.

Quick Start & Requirements

Install via pip install -e . and run bash scripts/install_deepeyes.sh for additional dependencies.
Requires Qwen-2.5-VL-7B-Instruct or Qwen-2.5-VL-32B-Instruct as the foundation model.
Training necessitates substantial hardware: minimum 32 GPUs (for 7B) or 64 GPUs (for 32B), and at least 1200GB CPU RAM per node for high-resolution datasets.
A vllm serving instance of Qwen-2.5-72B-Instruct is needed for LLM-as-a-judge verification.
Refer to the project homepage for detailed setup and training scripts.

Highlighted Details

End-to-end RL training guided by outcome rewards, no cold-start or supervised fine-tuning needed.
Achieves significant performance gains on high-resolution benchmarks and shows strong generalization.
Emergent thinking patterns observed, including visual search and tool-assisted answer verification.
Framework supports custom datasets and tools by adding an env_name field and implementing ToolBase subclasses.

Maintenance & Community

The project is actively maintained, with the last sync with the VeRL main branch on April 23, 2025. Future updates will be on the dev branch. Links to community channels or roadmaps are not explicitly provided in the README.

Licensing & Compatibility

This project is released under the Apache License. This license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

Training DeepEyes requires significant computational resources (multiple high-end GPUs and substantial RAM), making it inaccessible for users without access to such infrastructure. The project relies on specific foundation models and the VeRL framework, which may introduce dependencies and potential compatibility issues with other systems.

Health Check

Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

39 stars in the last 30 days