OpenClaw-RL by Gen-Verse

Personalize AI agents through conversational reinforcement learning

Created 3 months ago

5,470 stars

Top 9.1% on SourcePulse

View on GitHub

6 Experts Love This Project

Pawel Garbacki

Cofounder of Fireworks AI

Amin Ahmad

Cofounder of Vectara

Vincent Weisser

Cofounder of Prime Intellect

Eric Zhang

Founding Engineer at Modal

and 2 more!

Project Summary

OpenClaw-RL empowers self-hosted AI agents with personalization capabilities through continuous reinforcement learning from natural conversation feedback. It targets engineers and researchers seeking to enhance LLM agents without interrupting live usage, offering a privacy-preserving, asynchronous framework that transforms dialogue into actionable training signals for improved agent performance over time.

How It Works

The framework employs a fully asynchronous, 4-component architecture (serving, rollout collection, PRM judging, policy training) to decouple processes, allowing continuous background optimization without blocking user interactions. It automatically converts multi-turn conversations into training signals by classifying turns and using subsequent messages as state feedback. Two distinct learning paradigms are supported: Binary RL (GRPO) leverages a Process Reward Model (PRM) for scalar rewards, while On-Policy Distillation (OPD) uses hindsight-derived textual hints to guide policy updates via an "enhanced teacher" model, offering richer directional learning.

Quick Start & Requirements

Setup requires a robust environment: 8x GPUs (configurable via environment variables like NUM_GPUS, ACTOR_GPUS, ROLLOUT_GPUS, PRM_GPUS) running CUDA 12.9 and Python 3.12, with the Slime RL framework as a prerequisite. Core execution involves navigating to the slime directory and running specific bash scripts like ../openclaw-rl/run_qwen3_4b_openclaw_rl.sh for Binary RL or ../openclaw-opd/run_qwen3_4b_openclaw_opd.sh for OPD. The system exposes an OpenAI-compatible API endpoint at http://<HOST_IP>:30000/v1 for integration with OpenClaw. Detailed environment setup instructions are available in ./instructions/README.md.

Highlighted Details

Fully asynchronous, non-blocking 4-component architecture.
Self-hosted and privacy-focused: entire stack runs on user infrastructure, with no data exfiltration.
Automatic gradient generation from live conversations, eliminating manual labeling.
Dual learning paradigms: GRPO for scalar rewards and OPD for textual distillation.
Production-ready features include session-aware training, graceful weight updates, and robust hint filtering for OPD.

Maintenance & Community

The project roadmap indicates ongoing development with planned enhancements for broader model support and scalable infrastructure. No specific community channels (e.g., Discord, Slack) or notable contributors are detailed in the provided README.

Licensing & Compatibility

License information is not specified in the provided README. This absence requires further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats

The system has significant hardware demands, defaulting to 8 GPUs, and requires specific software versions (CUDA 12.9, Python 3.12). Its reliance on the Slime framework and the lack of explicit licensing information present potential adoption blockers. The project appears to be in active development, as indicated by its roadmap.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

189 stars in the last 30 days