h2o-wizardlm by h2oai

Open-source tool for LLM fine-tuning dataset creation

Created 2 years ago

310 stars

Top 86.9% on SourcePulse

Project Summary

This project provides an open-source implementation of the WizardLM method for generating complex instruction-following datasets, specifically tailored for fine-tuning Large Language Models (LLMs). It aims to enable the creation of ChatGPT-like models using entirely Apache 2.0 licensed components, avoiding TOS violations associated with datasets like ShareGPT. The primary audience is researchers and developers looking to build and fine-tune LLMs with high-quality, diverse instruction sets.

How It Works

The core approach involves taking an existing instruction-tuned LLM and using it to automatically rewrite simple "seed" prompts into more complex, multi-faceted instructions. This process leverages the "Evol-Instruct" methodology described in the linked paper, effectively increasing the complexity and richness of the training data. The advantage is creating sophisticated prompts that can lead to more capable and nuanced LLM responses, facilitating the development of more advanced AI assistants.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.10.
Usage: Edit wizardlm.py to specify the base model and desired dataset size, then run python wizardlm.py.
Output: Generates a JSON file named wizard_lm.<uuid>.json.

Highlighted Details

Enables building ChatGPT-like models using Apache 2.0 licensed components.
Automates the creation of high-complexity instruction prompts from simpler seeds.
Aims for a complete Apache 2.0 fine-tuning loop (e.g., Open LLaMa + oasst1 -> WizardLM).

Maintenance & Community

The project is hosted by h2oai. Further community engagement details (Discord/Slack, roadmap) are not explicitly provided in the README.

Licensing & Compatibility

The project is licensed under Apache 2.0, allowing for commercial use and integration with closed-source systems.

Limitations & Caveats

The current implementation is noted as slow, even with optimizations. It requires a reasonably good instruct-tuned LLM for effective prompt generation, and generated responses can sometimes be empty. The project is actively under development with plans for speed improvements and enhanced response generation.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days