godot-dodo by minosvasilias

Pipeline for finetuning language models on GDScript code

Created 2 years ago

559 stars

Top 57.3% on SourcePulse

View on GitHub

3 Experts Love This Project

Cofounder of OpenRouter, OpenSea

Project Summary

This project provides a pipeline for finetuning large language models (LLMs) specifically for GDScript generation, a domain-specific language for the Godot game engine. It addresses the performance degradation of general-purpose LLMs on less common languages by creating specialized models trained on human-written GDScript code scraped from GitHub. This benefits game developers using Godot by offering more accurate and consistent code generation.

How It Works

The core methodology involves creating a dataset of comment:code pairs derived exclusively from MIT-licensed GDScript projects on GitHub. Unlike other approaches that use LLMs for dataset generation, this project uses LLMs (like GPT-3.5-turbo) solely for labeling human-written code snippets. This ensures the training data consists of authentic, human-created code, leading to more robust and syntactically correct GDScript generation.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Requires GitHub and OpenAI API keys for dataset generation.
Finetuning requires significant GPU resources (e.g., multiple A100 80GB GPUs recommended for LLaMA models).
Inference demo available via notebook: Google Colab Link

Highlighted Details

Finetuned models demonstrate significantly higher consistency in GDScript syntax compared to GPT-4/3.5-turbo.
Variants trained on code-specific base models can outperform general models on complex instructions.
Dataset generation involves scraping GitHub, identifying GDScript files, splitting them into functions, and generating comments using an LLM.
Pre-assembled dataset godot_dodo_4x_60k (approx. 60k rows) is available.

Maintenance & Community

Project is authored by Markus Sobkowski.
Acknowledgements mention MIT-licensed Godot projects and fluidstack.io for GPU instances.
Citation details are provided for the project, LLaMA, and Stanford Alpaca.

Licensing & Compatibility

Dataset is exclusively sourced from MIT-licensed GitHub repositories.
No explicit license is mentioned for the repository's code itself, but the data sourcing implies a permissive approach.

Limitations & Caveats

The primary weakness identified is a loss in verbosity and context awareness, where models may assume external scope due to the nature of scraped code samples. This could be improved with a more sophisticated dataset construction.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days