LlamaGym  by KhoomeiK

SDK for fine-tuning LLM agents with online reinforcement learning

created 1 year ago
1,208 stars

Top 33.1% on sourcepulse

GitHubView on GitHub
Project Summary

LlamaGym simplifies the process of fine-tuning Large Language Model (LLM) agents using online reinforcement learning (RL) within Gymnasium-compatible environments. It targets researchers and developers looking to enable LLMs to learn and adapt in real-time through interaction, abstracting away the complexities of managing conversational context, reward signals, and RL algorithms like PPO.

How It Works

LlamaGym provides an Agent abstract class that encapsulates the core logic for integrating LLMs with RL. It handles the boilerplate code for managing LLM conversation history, batching episodes, assigning rewards, and setting up RL training loops. This approach allows users to focus on defining agent behavior through methods like get_system_prompt, format_observation, and extract_action, making it easier to experiment with prompting and hyperparameters across various RL environments.

Quick Start & Requirements

  • Install via pip: pip install llamagym
  • Requires a pre-trained LLM (e.g., Llama-2-7b) and its corresponding tokenizer.
  • Needs a Gymnasium environment (e.g., gym.make("Blackjack-v1")).
  • GPU and CUDA are recommended for efficient LLM fine-tuning.
  • Full example: examples/blackjack.py

Highlighted Details

  • Simplifies online RL fine-tuning for LLM agents.
  • Abstracts LLM context management, reward assignment, and PPO setup.
  • Enables rapid iteration on agent prompting and hyperparameters.
  • Integrates seamlessly with Gymnasium environments.

Maintenance & Community

  • Described as a "weekend project" and "WIP" (Work In Progress).
  • Welcomes contributions.
  • No specific community links (Discord/Slack) or roadmap are provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. The repository's license file should be consulted for details.

Limitations & Caveats

Online RL convergence is notoriously difficult and requires hyperparameter tuning. The current implementation prioritizes simplicity over compute efficiency compared to frameworks like Lamorel. Supervised fine-tuning before RL is suggested but not yet implemented.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
100 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.