Open-source tool for LLM fine-tuning dataset creation
Top 87.5% on sourcepulse
This project provides an open-source implementation of the WizardLM method for generating complex instruction-following datasets, specifically tailored for fine-tuning Large Language Models (LLMs). It aims to enable the creation of ChatGPT-like models using entirely Apache 2.0 licensed components, avoiding TOS violations associated with datasets like ShareGPT. The primary audience is researchers and developers looking to build and fine-tune LLMs with high-quality, diverse instruction sets.
How It Works
The core approach involves taking an existing instruction-tuned LLM and using it to automatically rewrite simple "seed" prompts into more complex, multi-faceted instructions. This process leverages the "Evol-Instruct" methodology described in the linked paper, effectively increasing the complexity and richness of the training data. The advantage is creating sophisticated prompts that can lead to more capable and nuanced LLM responses, facilitating the development of more advanced AI assistants.
Quick Start & Requirements
pip install -r requirements.txt
wizardlm.py
to specify the base model and desired dataset size, then run python wizardlm.py
.wizard_lm.<uuid>.json
.Highlighted Details
Maintenance & Community
The project is hosted by h2oai. Further community engagement details (Discord/Slack, roadmap) are not explicitly provided in the README.
Licensing & Compatibility
The project is licensed under Apache 2.0, allowing for commercial use and integration with closed-source systems.
Limitations & Caveats
The current implementation is noted as slow, even with optimizations. It requires a reasonably good instruct-tuned LLM for effective prompt generation, and generated responses can sometimes be empty. The project is actively under development with plans for speed improvements and enhanced response generation.
9 months ago
Inactive