RL for LLMs in verifiable environments
Top 26.1% on sourcepulse
This repository provides tools for reinforcement learning (RL) with large language models (LLMs) in verifiable environments, specifically targeting multi-turn tool use. It is designed for researchers and practitioners working on advanced LLM-based agents that require complex interaction and validation.
How It Works
The core approach leverages Generative Reward Optimization (GRO) for RL training within custom multi-turn environments. It supports multi-agent interactions and features specialized environments like ToolEnv
and CodeEnv
with XML parsers for dataset formatting and rubrics for evaluating correctness. This design facilitates training LLMs to reliably use tools and engage in complex, verifiable tasks.
Quick Start & Requirements
git clone
and uv sync
followed by uv pip install flash-attn --no-build-isolation
. Activate the virtual environment with source .venv/bin/activate
.wandb
and huggingface-cli
logins (or report_to=None
).accelerate launch
for training.Highlighted Details
CodeEnv
and ToolEnv
.DoubleCheckEnv
, CodeEnv
, and ToolEnv
.Maintenance & Community
The project is presented as in-progress research code. No specific community channels or maintenance details are provided in the README.
Licensing & Compatibility
The repository does not explicitly state a license. The citation suggests it is intended for research purposes. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
This repository is in-progress research code and is not guaranteed to yield stable or optimal training results. It is primarily for multi-turn LLM RL and may not be suitable if multi-turn tool calling or multi-agent interactions are not required.
23 hours ago
1 day