GPTeacher  by teknium1

GPT-4 generated datasets for instruction tuning

created 2 years ago
1,639 stars

Top 26.2% on sourcepulse

GitHubView on GitHub
Project Summary

GPTeacher provides a curated collection of modular datasets generated by GPT-4, designed for fine-tuning large language models. It targets researchers and developers aiming to improve instruction following, role-playing, coding, and tool usage capabilities in LLMs. The datasets offer diverse, high-quality examples to enhance model performance in these specific areas.

How It Works

The datasets are generated using GPT-4, adapting prompts similar to those used in projects like Stanford Alpaca, with specific enhancements for different task types. General-Instruct includes Chain of Thought reasoning and logic puzzles, while Code-Instruct offers programming tasks. Roleplay-Instruct focuses on character simulation, and Toolformer datasets are designed for tool integration. Each dataset is further split into similarity-cleaned subsets, allowing users to select varying levels of data purity.

Quick Start & Requirements

  • Datasets are available for download directly from the repository.
  • No specific installation commands are provided; usage typically involves downloading and integrating the data into existing fine-tuning pipelines.
  • Requires a Python environment for data processing and LLM fine-tuning.

Highlighted Details

  • Includes General-Instruct (20k examples), Code-Instruct (~5350 examples), and Roleplay-Instruct datasets.
  • Roleplay V2 is 2.5x larger and more diverse than the original, featuring simulated chat histories.
  • Datasets are formatted to be compliant with the Stanford Alpaca dataset structure (instruction, input, output).
  • Toolformer datasets support tools like search, Python, terminal, and Wikipedia.

Maintenance & Community

  • The project is maintained by teknium1.
  • No specific community links (Discord, Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state a license.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that prompt details for dataset generation are not fully provided, which might hinder reproducibility. The Code-Instruct dataset was initially delayed due to cleaning, indicating potential ongoing refinement.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.