qwen-scheduler-grpo  by anakin87

LLM learns event scheduling via RL

Created 9 months ago
255 stars

Top 98.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository presents an experiment in training a Large Language Model (LLM) using Group Relative Policy Optimization (GRPO) to generate event schedules based on priorities. It targets researchers and practitioners interested in applying Reinforcement Learning (RL) to LLMs without relying on supervised examples, offering a novel approach to a problem typically solved with deterministic programming. The project demonstrates the potential for LLMs to learn complex reasoning and optimization tasks through reward-driven learning.

How It Works

The core methodology involves training a Qwen LLM with GRPO, a reinforcement learning technique that optimizes policies based on rewards rather than explicit target outputs. This contrasts with standard supervised fine-tuning. The model is prompted with event lists and priorities, and its generated schedules are evaluated against custom reward functions designed to maximize the weighted duration of selected events. This RL approach allows the LLM to discover scheduling strategies organically, pushing the boundaries of LLM learning paradigms.

Quick Start & Requirements

While direct installation commands are not provided, the repository includes essential components for replication: scripts for dataset generation, GRPO training notebooks, prompt templates, and evaluation tools. Key resources available are a blog post detailing the experiment, the generated Events Scheduling dataset, and a saved anakin87/qwen-scheduler-7b-grpo model. Users will require a Python environment equipped with LLM training dependencies, likely including GPU acceleration, though specific hardware requirements are not detailed.

Highlighted Details

The GRPO-trained 7B model demonstrated significant performance gains over its base model and a larger 14B model on the event scheduling task. It achieved proficiency in adhering to the specified output format and general scheduling rules. However, the model still exhibits challenges in consistently preventing overlapping events, indicating potential areas for reward function refinement.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), sponsorships, or a public roadmap are present in the provided README.

Licensing & Compatibility

The README does not specify a software license. This absence creates ambiguity regarding usage rights, commercial application, and derivative works.

Limitations & Caveats

The primary limitation identified is the model's struggle with preventing overlapping events, suggesting the reward function design requires further tuning for this specific constraint. The project appears to be an experimental demonstration rather than a production-ready tool.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
3 more.

ROLL by alibaba

2.3%
3k
RL library for large language models
Created 7 months ago
Updated 16 hours ago
Feedback? Help us improve.