qwen-scheduler-grpo by anakin87

LLM learns event scheduling via RL

Created 10 months ago

263 stars

Top 96.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Michael Han

Cofounder of Unsloth

Project Summary

This repository presents an experiment in training a Large Language Model (LLM) using Group Relative Policy Optimization (GRPO) to generate event schedules based on priorities. It targets researchers and practitioners interested in applying Reinforcement Learning (RL) to LLMs without relying on supervised examples, offering a novel approach to a problem typically solved with deterministic programming. The project demonstrates the potential for LLMs to learn complex reasoning and optimization tasks through reward-driven learning.

How It Works

The core methodology involves training a Qwen LLM with GRPO, a reinforcement learning technique that optimizes policies based on rewards rather than explicit target outputs. This contrasts with standard supervised fine-tuning. The model is prompted with event lists and priorities, and its generated schedules are evaluated against custom reward functions designed to maximize the weighted duration of selected events. This RL approach allows the LLM to discover scheduling strategies organically, pushing the boundaries of LLM learning paradigms.

Quick Start & Requirements

While direct installation commands are not provided, the repository includes essential components for replication: scripts for dataset generation, GRPO training notebooks, prompt templates, and evaluation tools. Key resources available are a blog post detailing the experiment, the generated Events Scheduling dataset, and a saved anakin87/qwen-scheduler-7b-grpo model. Users will require a Python environment equipped with LLM training dependencies, likely including GPU acceleration, though specific hardware requirements are not detailed.

Highlighted Details

The GRPO-trained 7B model demonstrated significant performance gains over its base model and a larger 14B model on the event scheduling task. It achieved proficiency in adhering to the specified output format and general scheduling rules. However, the model still exhibits challenges in consistently preventing overlapping events, indicating potential areas for reward function refinement.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), sponsorships, or a public roadmap are present in the provided README.

Licensing & Compatibility

The README does not specify a software license. This absence creates ambiguity regarding usage rights, commercial application, and derivative works.

Limitations & Caveats

The primary limitation identified is the model's struggle with preventing overlapping events, suggesting the reward function design requires further tuning for this specific constraint. The project appears to be an experimental demonstration rather than a production-ready tool.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days