GPTeacher by teknium1

GPT-4 generated datasets for instruction tuning

Created 2 years ago

1,631 stars

Top 25.7% on SourcePulse

View on GitHub

6 Experts Love This Project

Jeremy Howard

Cofounder of fast.ai

Omar Sanseviero

DevRel at Google DeepMind

and 2 more!

Project Summary

GPTeacher provides a curated collection of modular datasets generated by GPT-4, designed for fine-tuning large language models. It targets researchers and developers aiming to improve instruction following, role-playing, coding, and tool usage capabilities in LLMs. The datasets offer diverse, high-quality examples to enhance model performance in these specific areas.

How It Works

The datasets are generated using GPT-4, adapting prompts similar to those used in projects like Stanford Alpaca, with specific enhancements for different task types. General-Instruct includes Chain of Thought reasoning and logic puzzles, while Code-Instruct offers programming tasks. Roleplay-Instruct focuses on character simulation, and Toolformer datasets are designed for tool integration. Each dataset is further split into similarity-cleaned subsets, allowing users to select varying levels of data purity.

Quick Start & Requirements

Datasets are available for download directly from the repository.
No specific installation commands are provided; usage typically involves downloading and integrating the data into existing fine-tuning pipelines.
Requires a Python environment for data processing and LLM fine-tuning.

Highlighted Details

Includes General-Instruct (20k examples), Code-Instruct (~5350 examples), and Roleplay-Instruct datasets.
Roleplay V2 is 2.5x larger and more diverse than the original, featuring simulated chat histories.
Datasets are formatted to be compliant with the Stanford Alpaca dataset structure (instruction, input, output).
Toolformer datasets support tools like search, Python, terminal, and Wikipedia.

Maintenance & Community

The project is maintained by teknium1.
No specific community links (Discord, Slack) or roadmap are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that prompt details for dataset generation are not fully provided, which might hinder reproducibility. The Code-Instruct dataset was initially delayed due to cleaning, indicating potential ongoing refinement.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days