Pipeline for finetuning language models on GDScript code
Top 58.7% on sourcepulse
This project provides a pipeline for finetuning large language models (LLMs) specifically for GDScript generation, a domain-specific language for the Godot game engine. It addresses the performance degradation of general-purpose LLMs on less common languages by creating specialized models trained on human-written GDScript code scraped from GitHub. This benefits game developers using Godot by offering more accurate and consistent code generation.
How It Works
The core methodology involves creating a dataset of comment:code pairs derived exclusively from MIT-licensed GDScript projects on GitHub. Unlike other approaches that use LLMs for dataset generation, this project uses LLMs (like GPT-3.5-turbo) solely for labeling human-written code snippets. This ensures the training data consists of authentic, human-created code, leading to more robust and syntactically correct GDScript generation.
Quick Start & Requirements
pip install -r requirements.txt
Highlighted Details
godot_dodo_4x_60k
(approx. 60k rows) is available.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The primary weakness identified is a loss in verbosity and context awareness, where models may assume external scope due to the nature of scraped code samples. This could be improved with a more sophisticated dataset construction.
2 years ago
1 day