Research paper for LLM instruction tuning via self-guided data selection
Top 76.0% on sourcepulse
This repository provides a self-guided methodology for Large Language Models (LLMs) to autonomously select high-quality instruction-tuning data, reducing manual curation. It introduces the Instruction-Following Difficulty (IFD) score, enabling LLMs to filter "cherry data" based on their own perceived difficulty, benefiting researchers and practitioners aiming to improve LLM performance efficiently.
How It Works
The core approach involves a three-phase process: "Learning from Brief Experience," where an LLM is exposed to a subset of the data; "Evaluating Based on Experience," using the novel IFD score to quantify sample difficulty by comparing response generation capabilities; and "Retraining from Self-Guided Experience," where the LLM is fine-tuned on cherry-picked data with high IFD scores. This self-supervised method avoids reliance on external models for data quality assessment.
Quick Start & Requirements
pip install -r requirements.txt
(or manually install tqdm
, scikit-learn
).cherry_seletion/data_analysis.py
script is used for initial data processing and score calculation, followed by cherry_seletion/data_by_cluster.py
or cherry_seletion/data_by_IFD.py
for data selection.Highlighted Details
Maintenance & Community
The project is associated with NAACL'24 and ACL'24 publications, indicating active research. Contact information for Ming Li is provided for questions.
Licensing & Compatibility
The repository does not explicitly state a license. The code is presented for research purposes, and commercial use would require clarification.
Limitations & Caveats
The README mentions that some results are based on specific hardware limitations (7B models) and that cherry_data_v2 based on LLaMA 2 was planned for release. The exact licensing status for commercial use is not specified.
1 month ago
1 day