Discover and explore top open-source AI tools and projects—updated daily.
anakin87LLM learns event scheduling via RL
Top 98.8% on SourcePulse
This repository presents an experiment in training a Large Language Model (LLM) using Group Relative Policy Optimization (GRPO) to generate event schedules based on priorities. It targets researchers and practitioners interested in applying Reinforcement Learning (RL) to LLMs without relying on supervised examples, offering a novel approach to a problem typically solved with deterministic programming. The project demonstrates the potential for LLMs to learn complex reasoning and optimization tasks through reward-driven learning.
How It Works
The core methodology involves training a Qwen LLM with GRPO, a reinforcement learning technique that optimizes policies based on rewards rather than explicit target outputs. This contrasts with standard supervised fine-tuning. The model is prompted with event lists and priorities, and its generated schedules are evaluated against custom reward functions designed to maximize the weighted duration of selected events. This RL approach allows the LLM to discover scheduling strategies organically, pushing the boundaries of LLM learning paradigms.
Quick Start & Requirements
While direct installation commands are not provided, the repository includes essential components for replication: scripts for dataset generation, GRPO training notebooks, prompt templates, and evaluation tools. Key resources available are a blog post detailing the experiment, the generated Events Scheduling dataset, and a saved anakin87/qwen-scheduler-7b-grpo model. Users will require a Python environment equipped with LLM training dependencies, likely including GPU acceleration, though specific hardware requirements are not detailed.
Highlighted Details
The GRPO-trained 7B model demonstrated significant performance gains over its base model and a larger 14B model on the event scheduling task. It achieved proficiency in adhering to the specified output format and general scheduling rules. However, the model still exhibits challenges in consistently preventing overlapping events, indicating potential areas for reward function refinement.
Maintenance & Community
No specific details regarding maintainers, community channels (e.g., Discord, Slack), sponsorships, or a public roadmap are present in the provided README.
Licensing & Compatibility
The README does not specify a software license. This absence creates ambiguity regarding usage rights, commercial application, and derivative works.
Limitations & Caveats
The primary limitation identified is the model's struggle with preventing overlapping events, suggesting the reward function design requires further tuning for this specific constraint. The project appears to be an experimental demonstration rather than a production-ready tool.
8 months ago
Inactive
KhoomeiK
NVlabs
SakanaAI
alibaba