self-instruct by yizhongw

Self-Instruct: Research paper for aligning language models with self-generated instructions

Created 3 years ago

4,560 stars

Top 10.7% on SourcePulse

View on GitHub

8 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Pawel Garbacki

Cofounder of Fireworks AI

Luca Soldaini

Research Scientist at Ai2

Travis Fischer

Founder of Agentic

and 4 more!

Project Summary

This repository provides the Self-Instruct framework and associated data for aligning pre-trained language models with instructions. It enables instruction-following capabilities without extensive manual annotation by using a model's own generations to create instructional data, benefiting researchers and developers aiming to improve LLM usability.

How It Works

Self-Instruct employs an iterative bootstrapping algorithm. It starts with a seed set of instructions, prompts a language model (like GPT-3) to generate new instructions and input-output instances, and then filters these generations for quality and diversity. The refined data is added back to the pool, allowing for repeated cycles to build a large, instruction-following dataset. This approach reduces reliance on costly manual data creation.

Quick Start & Requirements

Data Usage: The dataset is available at data/gpt3-generations/batch_221203/all_instances_82K.jsonl. A finetuning-ready version is at data/finetuning/self_instruct_221203.
Generation: Scripts are provided for generating data from scratch using the OpenAI API (tested with GPT-3). Key scripts include generate_instructions.sh, is_clf_or_not.sh, generate_instances.sh, and prepare_for_finetuning.sh.
Prerequisites: Requires access to the OpenAI API.

Highlighted Details

Released dataset contains 52k instructions and 82k instance inputs/outputs.
A separate 252 expert-written task dataset is available for human evaluation.
The paper reports 46% of generated data points may have issues, recommending cautious use and further filtering.

Maintenance & Community

The project is marked as "still in progress" with potential for code and data updates. No specific community channels or contributor details are listed in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The code and data are presented for research purposes, and users are encouraged to cite the associated paper. Commercial use is not specified.

Limitations & Caveats

The data generation pipeline is only tested with GPT-3 via the OpenAI API. The generated data may contain errors or biases, with a significant portion identified as potentially problematic.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

23 stars in the last 30 days