GPT-4 generated datasets for instruction tuning
Top 26.2% on sourcepulse
GPTeacher provides a curated collection of modular datasets generated by GPT-4, designed for fine-tuning large language models. It targets researchers and developers aiming to improve instruction following, role-playing, coding, and tool usage capabilities in LLMs. The datasets offer diverse, high-quality examples to enhance model performance in these specific areas.
How It Works
The datasets are generated using GPT-4, adapting prompts similar to those used in projects like Stanford Alpaca, with specific enhancements for different task types. General-Instruct includes Chain of Thought reasoning and logic puzzles, while Code-Instruct offers programming tasks. Roleplay-Instruct focuses on character simulation, and Toolformer datasets are designed for tool integration. Each dataset is further split into similarity-cleaned subsets, allowing users to select varying levels of data purity.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README mentions that prompt details for dataset generation are not fully provided, which might hinder reproducibility. The Code-Instruct dataset was initially delayed due to cleaning, indicating potential ongoing refinement.
1 year ago
1 day