godot-dodo  by minosvasilias

Pipeline for finetuning language models on GDScript code

created 2 years ago
554 stars

Top 58.7% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a pipeline for finetuning large language models (LLMs) specifically for GDScript generation, a domain-specific language for the Godot game engine. It addresses the performance degradation of general-purpose LLMs on less common languages by creating specialized models trained on human-written GDScript code scraped from GitHub. This benefits game developers using Godot by offering more accurate and consistent code generation.

How It Works

The core methodology involves creating a dataset of comment:code pairs derived exclusively from MIT-licensed GDScript projects on GitHub. Unlike other approaches that use LLMs for dataset generation, this project uses LLMs (like GPT-3.5-turbo) solely for labeling human-written code snippets. This ensures the training data consists of authentic, human-created code, leading to more robust and syntactically correct GDScript generation.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Requires GitHub and OpenAI API keys for dataset generation.
  • Finetuning requires significant GPU resources (e.g., multiple A100 80GB GPUs recommended for LLaMA models).
  • Inference demo available via notebook: Google Colab Link

Highlighted Details

  • Finetuned models demonstrate significantly higher consistency in GDScript syntax compared to GPT-4/3.5-turbo.
  • Variants trained on code-specific base models can outperform general models on complex instructions.
  • Dataset generation involves scraping GitHub, identifying GDScript files, splitting them into functions, and generating comments using an LLM.
  • Pre-assembled dataset godot_dodo_4x_60k (approx. 60k rows) is available.

Maintenance & Community

  • Project is authored by Markus Sobkowski.
  • Acknowledgements mention MIT-licensed Godot projects and fluidstack.io for GPU instances.
  • Citation details are provided for the project, LLaMA, and Stanford Alpaca.

Licensing & Compatibility

  • Dataset is exclusively sourced from MIT-licensed GitHub repositories.
  • No explicit license is mentioned for the repository's code itself, but the data sourcing implies a permissive approach.

Limitations & Caveats

The primary weakness identified is a loss in verbosity and context awareness, where models may assume external scope due to the nature of scraped code samples. This could be improved with a more sophisticated dataset construction.

Health Check
Last commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.