self-instruct  by yizhongw

Self-Instruct: Research paper for aligning language models with self-generated instructions

created 2 years ago
4,440 stars

Top 11.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the Self-Instruct framework and associated data for aligning pre-trained language models with instructions. It enables instruction-following capabilities without extensive manual annotation by using a model's own generations to create instructional data, benefiting researchers and developers aiming to improve LLM usability.

How It Works

Self-Instruct employs an iterative bootstrapping algorithm. It starts with a seed set of instructions, prompts a language model (like GPT-3) to generate new instructions and input-output instances, and then filters these generations for quality and diversity. The refined data is added back to the pool, allowing for repeated cycles to build a large, instruction-following dataset. This approach reduces reliance on costly manual data creation.

Quick Start & Requirements

  • Data Usage: The dataset is available at data/gpt3-generations/batch_221203/all_instances_82K.jsonl. A finetuning-ready version is at data/finetuning/self_instruct_221203.
  • Generation: Scripts are provided for generating data from scratch using the OpenAI API (tested with GPT-3). Key scripts include generate_instructions.sh, is_clf_or_not.sh, generate_instances.sh, and prepare_for_finetuning.sh.
  • Prerequisites: Requires access to the OpenAI API.

Highlighted Details

  • Released dataset contains 52k instructions and 82k instance inputs/outputs.
  • A separate 252 expert-written task dataset is available for human evaluation.
  • The paper reports 46% of generated data points may have issues, recommending cautious use and further filtering.

Maintenance & Community

The project is marked as "still in progress" with potential for code and data updates. No specific community channels or contributor details are listed in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The code and data are presented for research purposes, and users are encouraged to cite the associated paper. Commercial use is not specified.

Limitations & Caveats

The data generation pipeline is only tested with GPT-3 via the OpenAI API. The generated data may contain errors or biases, with a significant portion identified as potentially problematic.

Health Check
Last commit

2 years ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
97 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.