OpenClaw-RL  by Gen-Verse

Personalize AI agents through conversational reinforcement learning

Created 1 week ago

New!

1,307 stars

Top 30.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

OpenClaw-RL empowers self-hosted AI agents with personalization capabilities through continuous reinforcement learning from natural conversation feedback. It targets engineers and researchers seeking to enhance LLM agents without interrupting live usage, offering a privacy-preserving, asynchronous framework that transforms dialogue into actionable training signals for improved agent performance over time.

How It Works

The framework employs a fully asynchronous, 4-component architecture (serving, rollout collection, PRM judging, policy training) to decouple processes, allowing continuous background optimization without blocking user interactions. It automatically converts multi-turn conversations into training signals by classifying turns and using subsequent messages as state feedback. Two distinct learning paradigms are supported: Binary RL (GRPO) leverages a Process Reward Model (PRM) for scalar rewards, while On-Policy Distillation (OPD) uses hindsight-derived textual hints to guide policy updates via an "enhanced teacher" model, offering richer directional learning.

Quick Start & Requirements

Setup requires a robust environment: 8x GPUs (configurable via environment variables like NUM_GPUS, ACTOR_GPUS, ROLLOUT_GPUS, PRM_GPUS) running CUDA 12.9 and Python 3.12, with the Slime RL framework as a prerequisite. Core execution involves navigating to the slime directory and running specific bash scripts like ../openclaw-rl/run_qwen3_4b_openclaw_rl.sh for Binary RL or ../openclaw-opd/run_qwen3_4b_openclaw_opd.sh for OPD. The system exposes an OpenAI-compatible API endpoint at http://<HOST_IP>:30000/v1 for integration with OpenClaw. Detailed environment setup instructions are available in ./instructions/README.md.

Highlighted Details

  • Fully asynchronous, non-blocking 4-component architecture.
  • Self-hosted and privacy-focused: entire stack runs on user infrastructure, with no data exfiltration.
  • Automatic gradient generation from live conversations, eliminating manual labeling.
  • Dual learning paradigms: GRPO for scalar rewards and OPD for textual distillation.
  • Production-ready features include session-aware training, graceful weight updates, and robust hint filtering for OPD.

Maintenance & Community

The project roadmap indicates ongoing development with planned enhancements for broader model support and scalable infrastructure. No specific community channels (e.g., Discord, Slack) or notable contributors are detailed in the provided README.

Licensing & Compatibility

License information is not specified in the provided README. This absence requires further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats

The system has significant hardware demands, defaulting to 8 GPUs, and requires specific software versions (CUDA 12.9, Python 3.12). Its reliance on the Slime framework and the lack of explicit licensing information present potential adoption blockers. The project appears to be in active development, as indicated by its roadmap.

Health Check
Last Commit

10 hours ago

Responsiveness

Inactive

Pull Requests (30d)
8
Issues (30d)
1
Star History
1,422 stars in the last 13 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
4 more.

agents by aiwaves-cn

0.1%
6k
Open-source framework for self-evolving, data-centric autonomous language agents
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), and
12 more.

rasa by RasaHQ

0.0%
21k
AI framework for automating text and voice conversations
Created 9 years ago
Updated 1 month ago
Feedback? Help us improve.