self-instruct  by yizhongw

Self-Instruct: Research paper for aligning language models with self-generated instructions

Created 3 years ago
4,560 stars

Top 10.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the Self-Instruct framework and associated data for aligning pre-trained language models with instructions. It enables instruction-following capabilities without extensive manual annotation by using a model's own generations to create instructional data, benefiting researchers and developers aiming to improve LLM usability.

How It Works

Self-Instruct employs an iterative bootstrapping algorithm. It starts with a seed set of instructions, prompts a language model (like GPT-3) to generate new instructions and input-output instances, and then filters these generations for quality and diversity. The refined data is added back to the pool, allowing for repeated cycles to build a large, instruction-following dataset. This approach reduces reliance on costly manual data creation.

Quick Start & Requirements

  • Data Usage: The dataset is available at data/gpt3-generations/batch_221203/all_instances_82K.jsonl. A finetuning-ready version is at data/finetuning/self_instruct_221203.
  • Generation: Scripts are provided for generating data from scratch using the OpenAI API (tested with GPT-3). Key scripts include generate_instructions.sh, is_clf_or_not.sh, generate_instances.sh, and prepare_for_finetuning.sh.
  • Prerequisites: Requires access to the OpenAI API.

Highlighted Details

  • Released dataset contains 52k instructions and 82k instance inputs/outputs.
  • A separate 252 expert-written task dataset is available for human evaluation.
  • The paper reports 46% of generated data points may have issues, recommending cautious use and further filtering.

Maintenance & Community

The project is marked as "still in progress" with potential for code and data updates. No specific community channels or contributor details are listed in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The code and data are presented for research purposes, and users are encouraged to cite the associated paper. Commercial use is not specified.

Limitations & Caveats

The data generation pipeline is only tested with GPT-3 via the OpenAI API. The generated data may contain errors or biases, with a significant portion identified as potentially problematic.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
23 stars in the last 30 days

Explore Similar Projects

Starred by Ross Taylor Ross Taylor(Cofounder of General Reasoning; Cocreator of Papers with Code), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
3 more.

curator by bespokelabsai

0.2%
2k
Synthetic data curation tool for post-training and structured data extraction
Created 1 year ago
Updated 6 days ago
Feedback? Help us improve.